Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

: At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efﬁcient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model’s feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.


Introduction
Speaker diarization is the task of determining "who spoke when" from a given audio input.This technique has various applications, including information retrieval from broadcast news, generation of conference transcripts, and analysis of telephone conversations [1,2].Moreover, it is a crucial component for implementing automatic speech recognition (ASR) in multi-speaker settings such as telephone conversations [3], conferences [4], and lectures [5].It has been demonstrated that accurate speaker diarization results can enhance ASR performance by constraining speech masks during the construction of speech-separating beamformers [6,7].
Traditional speaker diarization systems mainly adopt a clustering-based approach [8,9], which sequentially applies the following modules to the input audio to obtain the final results: voice activity detection, voice segmentation, feature extraction, and clustering.However, despite the satisfactory performance of traditional clustering-based systems, they exhibit certain limitations.First, they depend on multiple modules, each requiring individual training.Consequently, clustering-based speaker diarization systems need to be jointly calibrated between different modules, which introduces additional complexity in training.Secondly, although some recent work has attempted to handle situations with multiple speakers at the same time, the clustering-based approach implicitly assumes that each short speech segment has only one speaker speaking, making it difficult to handle overlapping speech [10].
End-to-end speaker diarization presents an optimal solution.Self-attentitive end-toend neural diarization (SA-EEND) [11] is an advanced method designed to model the joint speech activities of multiple speakers.Unlike clustering-based approaches, SA-EEND does not require clustering algorithms; instead, it directly outputs the speaking probabilities for all speakers within each time period.By utilizing a multi-label classification framework, SA-EEND naturally addresses the speaker overlap problem in both training and inference given a multi-speaker audio recording input.Furthermore, SA-EEND follows an endto-end training strategy, minimizing the diarization error rate and exhibiting remarkable accuracy on a dataset of two-speaker telephone conversations [12].
In this paper, we propose an end-to-end speaker diarization model base on a blueprint separable subsampling layer and an aggregate feature conformer (BSAC-EEND), which enhances system performance by optimizing convolutional subsampling layers and feature encoders.Firstly, the EEND system must process long utterances.During the training phase, to increase efficiency, the system segments input audio, which typically varies in length from a few seconds to several tens of seconds.In contrast, during the testing phase, the entire audio is assessed, leading to longer input data.A critical consideration is the model's ability to provide robust processing for various utterance durations, particularly for longer inputs.In the realm of speaker recognition, the multi-scale feature fusion conformer [13] has demonstrated proficiency in extracting robust global feature information from speech audios of varying lengths.This is attributed to the model's ability to capture both global and local features while integrating multi-scale feature representations.Significantly, it exhibits better performance in the case of long-duration test utterances [13].Consequently, we introduce the MFA structure into the EEND system's feature extractor.The conformer [14] merges a transformer [15] with convolutional neural networks to effectively capture global and local features, whereas the MFA structure concatenates the output features of all conformer blocks, resulting in multi-scale representations.Secondly, to diminish redundant information in audio frame sequences and enable the model to efficiently process long-duration audio input, downsampling is necessary.Currently, frame stacking subsampling is frequently employed [11,16], although [12] has shown that convolutional subsampling can notably enhance EEND networks' performance compared to frame-stacked subsampling.Motivated by this finding, we introduce BSConv [17], an advanced version of depthwise separable convolution [18] that more effectively leverages correlations within convolution kernels to achieve efficient convolution separation [17].We have designed a novel convolution subsampling layer utilizing both unconstrained BSConv (BSConv-U) [17] and subspace BSConv (BSConv-S) [17] structures to optimize the subsampling architecture.This paper's primary contributions are threefold: (1) Incorporation of the MFA structure into the EEND model's feature extractor, thereby enhancing its capability for extracting robust features.(2) Assessment of the performance of the convolutional subsampling layer based on BSConv-U and BSConv-S on both simulated and real data, with results demonstrating superiority over frame stacking subsampling and DSC-based convolutional subsampling.(3) Development of a novel EEND system which employs a conformer as the backbone network, integrates the MFA structure, and incorporates a convolutional subsampling layer based on BSConv.This new EEND system demonstrates significant performance improvements on both simulated and real datasets.

End-to-End Speaker Diarization
For direct generation of results from audio, one advantage of end-to-end speaker diarization approaches is that they do not need additional modules to detect unvoice or overlapped voices.Various methods exist for obtaining speaker diarization results within speech separation [35,36].However, these models are trained on clean speech or time-frequency masks derived from clean speech, making them unsuitable for training on complex audio datasets such as DIHARD [37,38].In contrast, EEND-based methods, designed to output the posterior probability of multiple speakers speaking simultaneously, can be trained on real data and are better for solving practical problems.The EEND model accepts a sequence of acoustic features, such as MFCC or Fbank, X = x t ∈ R F |t = 1, . . ., T .The neural network produces a speaker label sequence Y = (y t |t = 1, ..., T), wherein It means that at frame speaker is speaking when y t,k = 1, and denotes the maximum number of speakers the network can handle.For different speakers k and k , both y t,k and y t,k may equal one, indicating that speakers k and k are speaking simultaneously at frame t, i.e., overlapping speech.Assuming the output y t,k is conditional independence, the model's training objective is to maximize the diarization posterior probability logP(Y|X) ∼ ∑ t ∑ k log P(y t,k |X) of the training data.Multiple candidate reference label sequences k can be generated by exchanging the speaker label Y; thus, the loss is computed for all possible reference sequences, and the reference label with the lowest loss is used for error backpropagation.This approach is influenced by the permutation-free objective used in speech separation [39].Early EEND implementations employed bidirectional long short-term memory (BLSTM) [16], succeeded by self-attentionbased EEND networks [11], demonstrating improved DER on two-speaker data, such as the CALLHOME dataset (LDC2001S97) [40] and dialogue audio in the CSJ [41].
Drawing upon the high-performing conformer model in end-to-end continuous speech recognition, literature [12] devised a conformer-based EEND system (CB-EEND), incorporating SpecAugment [42] and a convolutional subsampling layer based on depthwise separable convolution, resulting in significant improvements.

BSAC-EEND
This chapter presents the fundamental composition and structure of the proposed BSAC-EEND.As shown in Figure 1, the model's comprehensive structure encompasses SpecAugment, a BSConv convolution subsampling layer, a conformer block, an MFA structure, a layer norm, a linear layer, and a sigmoid function.The subsequent sections provide detailed descriptions of the model's critical components.

Convolution Subsampling Based on BSConv
After in-depth analysis of CNNs, the work [17] came to the conclusion that convolution kernels generally show high redundancy in their depth direction, which means they have a higher intra-kernel correlation.Consequently, the author proposes BSConv, which employs a two-dimensional blueprint to represent each convolution kernel utilizing a weight vector distributed along the depth axis.
Inspired by the correlation within the convolution kernel, DSC assumes that there is high inter-kernel redundancy and adopts a structure which is opposite to BSConv.This, however, has be proven to be less effective when separating convolutions [43].
As a result, we explore the use of BSConv as the fundamental unit of convolutional subsampling and design a novel convolutional subsampling layer to subsample input features.BSConv encompasses two distinct structures: BSConv-U and BSConv-S.Analogous to DSC, they are also composed of pointwise convolution [18] and depthwise convolution [18] with a different order.The depthwise convolution applies a convolution kernel to each channel of the input feature map, resulting in a modification of the output feature map's shape without altering its channel count.In contrast, the pointwise convolution employs a 1 × 1 convolution for every feature map.It exclusively modifies the number of channels, leaving the feature map's size unaffected.Figure 2 illustrates the transformations in the shape of the input and output.Their structural comparisons to DSC are also displayed in Figure 2.

Pointwise
The structure of DSC and BSConv.
Drawing inspiration from these structures, we develop two convolutional subsampling modules, with their respective architectures depicted in Figure 3.We refer to the convolutional subsampling based on BSConv-U as CS-BSCU and the one based on BSConv-S as CS-BSCS.The performance of these BSConv-based convolutional subsampling modules in the EEND system will be discussed in detail in Section 4. Assuming the input acoustic feature possesses a length of T and a dimension of F, denoted as X ∈ R T×F , the feature undergoes SpecAugment before being transmitted to the convolutional subsampling layer.This novel convolutional subsampling process, embodied by BSConv, can be described as: where X ∈ R T×F , X ∈ R T×F , T = T/10.

Conformer Block
The self-attention mechanism derived from the transformer effectively captures longdistance global context dependencies; however, it lacks local detail recognition.To model global and local features more directly and efficiently, the conformer model [14] combines both CNN and transformer architectures, thereby better capturing the features necessary for speaker diarization in audio.The principal components of the conformer encoder block encompass the multi-head self-attention (MHSA) and convolution modules.Following the MHSA, the convolution module consists of pointwise convolution, one-dimensional depthwise convolution, and a batch normalization layer succeeding the convolution layer, which aids in training the depth model more readily.
The conformer encoder block's structure [14] diverges from the transformer [15], incorporating two feedforward neural network (FNN) modules possessing semi-residual connections.These two FNNs are positioned between the MHSA and convolution modules, resembling a macaron structure.Mathematically, for the i-th conformer block and input feature E i−1 , the output feature E i is calculated as follows: Ẽi

MFA Structure
The research in [44,45] demonstrated that low-level features can contribute to highquality speaker feature extraction.Following this theory, the ECAPA-TDNN [46] system aggregates the output features of all SE-Res2 blocks before the final pooling layer, resulting in a significant performance improvement.
Similarly, in our approach, we concatenate the output features of each conformer block for extracting features from EEND speaker diarization and then pass them to the layer norm layer.Unlike [13], however, we eliminate the pooling layer after the layer norm along with the batch norm layer and other structures.There are two reasons for this: Firstly, the EEND system's input audio typically contains multiple speakers, making it impossible to obtain representative speaker features using the pooling method.Secondly, the pooling method acquires segment-level representative speaker embeddings, whereas we require frame-level speaker activity analysis, rendering pooling inapplicable.Therefore, we directly use a linear layer and a sigmoid function to output the frame-level speaker activity's posterior probability.The specific process can be described as: where σ denotes the sigmoid function, E 1 , . . ., E i ∈ R T ×d , T = T/10, d symbolizes the output dimension of the transformer or conformer encoder, E, O ∈ R T ×D , i represents the number of encoder blocks, and D = d × i.

Data Preparation
To train the EEND system, we employed the simulated data generation algorithm suggested in [16], which has been used in several studies [11,12,16,47,48].Firstly, we selected N speakers, and for each speaker, we chose random speech segments.We then inserted silent parts between these segments before splicing them together to create N audio files.For each generated audio, a room impulse response was convolved, mixing N long recordings with a noise signal set at a random signal-to-noise ratio.The silence duration between speech segments was determined by an exponential distribution with a parameter of β.Consequently, β can be utilized to control the speaker overlap rate of the generated audio, where a larger β value results in lower speaker overlap [16].
The corpus used for the simulation comprises both Switchboard-2 (Phases I, II, and III), Switchboard Cellular (Parts 1 and 2) and NIST Speaker Recognition Evaluation (2004, 2005, 2006, and 2008), with all audio being telephone speech sampled at 8 kHz.In total, these corpora contain 6381 speakers, divided into 5743 speakers for the training set and 638 speakers for the test set.As there are no temporal annotations in these corpora, similar to [16], we utilized SAD based on a time-delayed neural network (TDNN) and the Kaldi speech toolkit [49] for extraction.Noise data were obtained from 37 background noise segments from the MUSAN corpus [50], while the Room Impulse Response (RIRs) data came from the simulated RIR dataset used in [51], with 10,000 records selected.SNR values were chosen from 10, 15, and 20 dB.We generated mixed audio datasets with two and three speakers separately using different β values to control the speaker overlap rate of the created mixed audio [16].Each speaker had 20-40 utterances, with detailed simulated data information presented in Table 1.We also select audios containing two and three speakers from the real telephone conversation dataset CALLHOME to construct new datasets, referred as CALLHOME-2spk and CALLHOME-3spk, respectively, and employed Kaldi's script to split them into two subsets.For CALLHOME-2spk, these subsets contained a fine-tuning set of 155 audios and a test set of 148 recordings, whereas for CALLHOME-3spk, the subsets had a fine-tuning set of 61 audios and a test set of 74 recordings.Further details are available in Table 2.

Experimental Setup
For audio features, we follow the configuration of the original paper, utilizing a 23-dimensional MFCC for SA-EEND [11], and 80-dimensional Fbank features for CB-EEND [12] and our BSAC-EEND, with a frame length of 25 ms and a frame shift of 10 ms.Regarding SpecAugment, we employed two frequency masks and two times masks, with each frequency mask capable of masking up to two consecutive frequency channels and each time mask having at most 120 consecutive time steps.
For SA-EEND, we employed four transformer encoders, with each encoder containing four attention heads.Each attention head generated a 256-dimensional frame-by-frame embedding vector, and the feedforward layer consisted of 1024 internal units.Positional encoding was not utilized.For CB-EEND and BSAC-EEND, we used four conformer encoders, with each encoder containing four attention heads.Each attention head generated a 256-dimensional frame-by-frame embedding vector, and the feedforward layer comprised 1024 internal units.Similarly, positional encoding was not used, and the convolution module's kernel size was set to 31.Furthermore, we applied the designed BSConv-based convolutional subsampling and MFA to CB-EEND for comparison.
In the training process with simulated data, we utilized the Adam optimizer [52] to update the neural network and the Noam scheduler [15] to adjust the learning rate, employing 100,000 warm-up steps.The Adam optimizer with a fixed learning rate was also employed for adaptation on real datasets.For efficient batching during training, audio was divided into 50 s segments when working with simulated data, and 200 s segments for the adaptation set.The training batch size was set to 64.During the inference stage, the entire audio was processed by the network without segmentation.We assessed the performance using the DER, which is defined by T Speech , T MI , T FA , and T CF as the total speech duration, missed speech duration, false positive speech duration, and speaker confusion duration, respectively.

DER =
T MI + T FA + T CF T Speech (10) we assessed performance on both the simulated dataset and the standard CALLHOME dataset, employing a collar tolerance of 0.25 s while comparing the hypothesized and reference speaker boundaries.Notably, speaker overlap was not excluded from the evaluation.

Improved Convolutional Subsampling
To investigate the performance of the proposed enhanced convolutional subsampling layer, we incorporated two convolutional subsampling structures depicted in Figure 3, substituting the respective elements in CB-EEND.These improvements were assessed on simulated datasets featuring two and three speakers, as well as real datasets.For the simulated data evaluation, the training sets of Sim2spk and Sim3spk were employed to train the EEND model, which was subsequently tested on their corresponding test sets (outlined in Table 1).Regarding the real dataset evaluation, the model was adapted using part 1 and was tested on part 2.
Overall, within CB-EEND, the CS-BSCS exhibits slightly better performance for twospeaker datasets and slightly inferior results for three-speaker datasets in comparison to the CS-BSCU layer.As discussed in Section 3.1, the convolutional subsampling based on BSConv outperforms the DSC-based approach, primarily due to its ability to reduce correlation within the convolution kernel more effectively.This facilitates the retention of more relevant information during the downsampling process.

MFA
We evaluated the performance of the MFA structure, as outlined in Section 3.3, within both SA-EEND and CB-EEND frameworks.Employing the experimental settings discussed in Section 4.1, we implemented the MFA structure in CB-EEND and assessed its performance on simulated and real datasets containing two or three speakers.Similar to Section 5.1, for the simulated data evaluation, we used the Sim2spk and Sim3spk training sets to train the EEND model and tested it on the respective test sets provided in Table 1.For real data evaluation, we utilized Part 1 of the CALLHOME dataset for adaptation and conducted testing on Part 2.
Table 4 presents the diarization error rate (DER) evaluation results.Compared with CB-EEND, the incorporation of the MFA structure yielded noticeable improvements in both simulated and real datasets.In the Sim2spk test set, relative improvements of 8.7% (β = 2), 8.3% (β = 3), and 4.5% (β = 5) were achieved, whereas a 3.3% improvement was observed in the CALLHOME-2spk set.Furthermore, the model attained 12.3% (β = 5), 13.3% (β = 7), and 17.3% (β = 11) relative improvements in the Sim3spk test set.Additionally, a 2.3% relative improvement was observed on CALLHOME-3spk Part 2. To better illustrate the impact of the MFA structure, we randomly selected a 50 s speech input from Sim2spk (β = 2) and reduced the high-dimensional features obtained prior to the linear layer to a two-dimensional space using principal component analysis (PCA). Figure 4 displays the visualization results for features extracted by SA-EEND, CB-EEND, and CB-EEND+MFA.Evidently, both CB-EEND and CB-EEND+MFA outperformed SA-EEND when discerning between silence frames and single-speaker speech frames, with CB-EEND+MFA proving superior in differentiating overlapping speaker frames.This demonstrates that the model benefits substantially from this multiscale representation convergence.

BSAC-EEND
In this section, we evaluate the proposed BSAC-EEND model, employing the two convolutional subsampling layer structures depicted in Figure 3.The evaluation dataset includes both two-and three-speaker datasets, incorporating simulated and real datasets for training and testing.The model's training and testing processes are consistent with those described in the previous two sections.

Conclusions
In this study, we introduce a conformer-based EEND system with an MFA structure and a novel convolutional subsampling layer.Our experimental results demonstrate that the EEND system benefits from the convergence and fusion of multi-scale representations facilitated by the MFA structure and the innovative convolutional subsampling layer.Although our system surpasses SA-EEND and CB-EEND in terms of overall performance, we observed that even after fine-tuning on real datasets, the performance of the new EEND system on the CALLHOME dataset appears relatively restrained, a dilemma potentially attributable to the domain mismatch problem existing between simulated training data and actual test data [12,53].As a future direction, we plan to incorporate transfer learning techniques to enhance the EEND system's training process, enabling better performance on real test data.Within the context of system architecture, existing literature [54] has suggested that quantum convolutional neural networks may enhance speech signal processing performance by extracting more representative speech features.Thus, employing quantum convolutional neural networks for future downsampling processes could be a potential approach.Moreover, given the importance of privacy protection, particularly in audio information modeling [55], it emerges as a potential area warranting further research focus in the future.

Figure 4 .
Figure 4. Visualization of model extraction features.
1Mixtures: total number of audios.
1Mixtures: total number of audios.

Table 4 .
The evaluation results of the MFA structure (DER).