In our experiments, we adopted LibriMix [
14], which is one of the popular benchmark datasets used for speech separation tasks. LibriMix is generated by mixing the audio files in LibriSpeech [
23]. Among LibriSpeech subsets, we used the 8 kHz version of train-clean-360, dev-clean, and test-clean to generate training, validation, and test sets, respectively. The original script [
14] only provides “2mix” and "3mix" subsets, where two and three are the numbers of speakers. We modified the script to generate 5mix and 10mix subsets as well.
4.1. Training and Evaluation Details
We used 256 convolution filters, with a kernel size of 16 and stride of 8 samples. The masker network processes a chunk of size 100 with 50% overlap. We used stacking sizes
for SepFormer blocks and
for IntraT and InterT each, totaling 48 transformer encoders. The expansion ratio is 2, generating 512 hidden channels in convolution transformers. We used a batch size of 6, gradient clipping to limit the
norm of the gradients to 5, and automatic mixed precision powered by PyTorch 2.0, Automatic Mixed Precision package (The Linux Foundation, San Francisco, CA, USA). We used a varied number of speakers with dynamic mixing [
24] to train the model. We randomly chose 2 to 5 sources from LibriSpeech audio files to generate 2mix, 3mix, 4mix, and 5mix samples. A single model is trained by audio samples, with the number of talkers varying from 2 to 5. When adding the audio files, the scales of the files were adjusted so that their SNRs should uniformly be distributed in [0 dB, 5 dB], with random speed perturbation uniformly carried out in [0.95, 1.05]. All actual values of the hyperparameters are listed in
Table 1.
For evaluation, we generated 3000 samples for each of the 2mix, 3mix, 5mix, and 10mix cases. The separation experiments are carried out under two conditions, denoted as a known or unknown number of speakers. The known condition means that the true speaker count is given with each of the evaluation samples. In the unknown condition, the true speaker count is not given, and the model should be able to cope with these types of cases. If the model supports fixed speaker counts, known condition performances are reported only. For known condition cases in the proposed method, the sequential step is repeated as many times as the given number of speakers, whereas for unknown condition cases, the extraction step is repeated until the stopping criterion in Algorithm 1 stops the iteration. Both
and
threshold values are set to
.
Figure 6 shows the trajectory of the residual signal powers. The x-axis is the number of steps in recursive source extraction, and the y-axis is the mean squared power of residual signals. There are four lines for 2mix, 3mix, 4mix, and 5mix LibriMix samples. The mean squared power of residual signals is computed by
, as shown in line 9 of Algorithm 1. The values plotted on the graph are
-averaged over test samples. For 2/3/5mix, it falls below the threshold
at steps 2, 3, and 5, respectively. For 10mix, it is above
, because 10mix mixtures are not included in the training samples. However, it is close to
, so most of the samples are successfully predicted.
Most speech separation methods do not guarantee optimal permutation. Therefore, the SI-SDR improvements are measured by matching the indexes of the extracted sources and true sources of the evaluation set. In the sequential extraction methods the extracted source at the current step is excluded from the next step. In unknown conditions, we give some penalties to under- or over-extraction cases by adding or replacing the SI-SDR improvements for wrongly estimated signals with 0, where under-extraction means that the model generates a smaller number of signals than the target and over-extraction means the opposite.
4.2. Speech Separation Results
Table 2 shows the evaluation results with a known number of speakers. We have compared the proposed method with various conventional speech separation methods: fully convolutional time-domain audio separation network (Conv-TasNet) [
5,
14], separation transformer (SepFormer) [
7], SepFormer with a pretrained diffusion model (SepFormer + DiffWave) [
25], optimal permutation-invariant training (HungarianPIT) [
26], iterative separation (SepIt) [
27], dual-path recurrent neural network (DPRNN) [
17], and one-and-rest permutation-invariant training (OR-PIT) [
12]. The SI-SDR improvements in
Table 2 are from the original papers on the same LibriMix dataset. In the second column, single task means that the number of speakers is fixed in the model, and the model should be trained with a fixed number of speakers as well. The training column shows the training data types required by each method. The separation performances are measured by SI-SDR improvements in dB units on Libri-nMix datasets, with a varying number of speakers. The number of speakers is given in both training and evaluation phases. Conv-TasNet and SepFormer is trained by 2mix and 3mix, so experimental results are available with 2mix and 3mix only. SepFormer+DiffWave shows the best 2mix result among all methods, but there are no 3mix, 5mix, and 10mix results because the model is configured to support 2mix only. HungarianPIT and SepIt provide 5mix and 10mix results, and DPRNN has 2/3/5mix results. Among single-task methods, the best 2mix result is 21.5 dB of SepFormer + DiffWave, 18.7 of SepFormer for 3mix, and 13.7 dB and 8.2 of SepIt for 5mix and 10mix. There is no single method that is applicable to all numbers of speaker conditions. The multi-task separation methods can deal with varying numbers of speakers using a single model. We implemented OR-PIT [
12] with minimum modifications to support a varying number of mixtures. It is trained by the 2/3/4/5mix dataset, and it provides separation results for the 2/3/5/10mix evaluation set. Although no 10mix training set is given, the trained model can be applied to the 10mix evaluation set. Compared to single-task methods, the SI-SDR results are 4.3 dB and 2.2 dB lower for 2mix and 3mix and 0.4 dB and 0.2 dB higher for 5mix and 10mix. The proposed method, deflationary extraction transformer (DExFormer in
Table 2), also supports multi-tasks. A single DExFormer model is trained by the 2/3/4/5mix dataset, and the separation results for the 2/3/5/10mix evaluation sets are obtained. For 2mix and 3mix cases, the SI-SDR improvements are 3.2 dB and 1.0 dB lower than SepFormer and SepFormer + DiffWave because the proposed method is optimized for general and realistic cases. Instead, the proposed method outperforms SepIt for 5mix and 10mix cases, with values being 2.2 dB and 1.5 dB higher. Given the number of speakers, the proposed method mostly outperforms the conventional methods, especially when there are more than three speakers. OR-PIT and the proposed method extract a single speaker one by one, so the extraction errors are unavoidably accumulated. Therefore, their performances gradually drops as the number of speakers grow. In contrast, other single-task methods use different models for different numbers of speakers, so no error accumulation is expected. Comparing 5mix SI-SDR improvements, HungarianPIT, SepIt, and DPRNN show 12.7, 13.7, and 8.7, respectively, while the extraction-based methods, OR-PIT, and the proposed method show 14.1 and 15.9, respectively. Similar observations can be found with 10mix cases, although the improvements are smaller than the 5mix ones. These improvements are due to the following explanations:
The multi-task methods share the same separation/extraction module for all numbers of speakers, and the shared module acts like a pretrained model in training, with mixtures of a higher number of speakers.
The benefit of model sharing is much higher than the error accumulation, which is proven by carrying out comparisons with the single-task methods in 5mix and 10mix cases.
Moreover, the performance achieved on the 10mix case is impressive since 10mix mixture samples are not included in the training dataset.
Table 3 shows the evaluation results with an unknown number of speakers. We have chosen gated LSTM [
15] and gated LSTM with a pretrained diffusion model [
25] as the state-of-the-art methods for multiple-speaker separation. Since the numbers of speakers are unknown, only multi-task models are applicable. The separation performances are measured by SI-SDR improvements in dB units on Libri-nMix evaluation datasets, with a varying number of speakers. The number of speakers is given in training and not given in the evaluation phase. We added sequence a termination criterion module to OR-PIT (OR-PIT + STC) and the proposed DExFormer (DExFormer + STC). There is only one model for each OR-PIT + STC and DExFormer + STC trained by the 2/3/4/5mix dataset, and separation results for the 2/3/5/10mix evaluation sets are obtained. In terms of SI-SDR improvements, DExFormer outperforms OR-PIT with SepFormer block by 1.0 dB, 1.3 dB, 1.8 dB, and 1.4 dB for the 2/3/5/10mix evaluation sets, respectively.
Based on known and unknown experimental results, the proposed DExFormer architecture is shown to improve the separation performance from 1.0 dB to 1.8 dB over SepFormer with OR-PIT. Comparing the results of DExFormer with varying number of speakers in
Table 3, the proposed DExFormer showed very little performance degradation except the 10mix. The evaluation results of gated LSTM models are provided for 5mix and 10mix only in their original papers. The SI-SDR improvements are 1.5 dB higher for 5mix evaluation and 0.1 dB lower for 10mix than that of gated LSTM + DiffWave. Even in conditions without the given number of speakers, the proposed method still outperforms conventional state-of-the-art methods. Comparing
Table 2 and
Table 3, the performance drop is up to 0.2 dB in 2/3/5mix. For 10mix, it is 0.8 dB, because more errors are expected with a larger number of speakers in the mixed recordings. Overcoming this degradation problem in many speakers speech separation could become a future research direction.
4.3. Analysis of the Proposed Sequence Termination Criterion
For unknown cases, predicting the true number of speakers is crucial to successful speaker separation. We have carried out comprehensive analyses on the prediction results of the proposed STC. For each evaluation sample, the predicted speaker count is obtained by the number of iterations before the termination of the extraction sequence, and then, it is compared to the true count.
Table 4 shows a confusion matrix for the prediction of speaker counts. Each column lists the number of training samples with their true speaker counts given by digits 2 to 5, and each row lists the number of samples with predicted speaker counts with digits 2 to 6. The diagonal elements are the number of samples with correct prediction results, and the off-diagonal ones are the count of samples with incorrect predictions. There are no samples with a true count of 6 because only the 2/3/4/5mix datasets are used in training, so the elements in row 6 are all incorrect.
From the confusion matrix, we have computed the precision and recall rates for each of the cases. The precision rate is calculated by
, where
and
stand for the numbers of true positives and false positives, respectively. The computed precision rates are shown in the last column of
Table 4. From 2 to 5 speakers, they are 99.8%, 98.0%, 95.6%, and 98.2%, respectively. In
Table 4, 200 samples from five speakers are predicted as six speakers, and they are all false positives; the number of true positives is 0 because there is no sample with six speakers. Therefore, the precision of six speakers is
. This shows that the proposed STC algorithm predicts speaker counts effectively in most cases. Though there is more than a 2% drop in precision rate at four speakers, less than 5% of errors is observed. The recall rate is defined by
, where
stands for the number of false negatives. The computed recall rates are, from 2 to 5 speakers, 99.7%, 99.1%, 96.8%, and 89.4%, respectively. There are no samples with true speaker counts being six, so both
and
are 0, and it is impossible to compute recall in this case. F1 scores are computed by
, and they are listed in the bottom row. The highest score of
is achieved for two speakers, and the lowest score of
is achieved for five speakers.
We have compared the proposed STC with other methods.
Table 5 shows the accuracy of binary decisions for speech or noise, and it shows multi-class classification accuracy values for speaker counts. The results are provided by OR-PIT [
12]. The binary classifier is an AlexNet-like convolutional neural network [
28]. If the classification’s output is noise, this implies that no speech remains, and OR-PIT stops recursion in Equation (
3). The multi-class classifier is also implemented by AlexNet, and it determines how many speakers are mixed. Its accuracy is much lower than that of the binary classifier. The role of the binary classifier is identical to the STC of the proposed method. Its confusion matrix is not given [
12], but an accuracy of
on the WSJ0-2mix and WSJ0-3mix datasets [
29] is reported. Although one-to-one comparison is not feasible, the proposed STC shows F1 scores of 93.6% to 99.7% on the LibriMix dataset, as shown in
Table 4, so its performance is on par with the binary classifier of OR-PIT.
Table 6 shows the results of speaker count detection with DPRNN [
17] and Gated LSTM [
15]. Speaker count detection experiments are carried out on the WSJ0-2/3/4/5mix datasets [
15,
29]. The experimental results are given by the confusion matrices in percentages only, and sample counts are not available. For a mixed sample, the average power of each output channel is computed and verified if the computed power is above a predefined, fixed threshold. The speaker count is determined by the number of channels, with their powers being above the threshold. The confusion matrices in
Table 6 are obtained by comparing a true number of speakers and the predicted speaker count for each of the samples. Diagonal elements of the confusion matrices are the ratios of correct sample counts to the total number of samples, which are identical to recall rates. Recall rates are
,
,
, and
for DPRNN;
,
,
, and
for Gated LSTM; and
,
,
, and
for the proposed DExFormer, as shown in
Table 4. Therefore, the prediction accuracy of the proposed method is shown to be much higher than those of DPRNN and Gated LSTM. In
Table 6, there are more errors in the lower triangular part, where the predicted speaker counts are larger than the actual number of speakers. For example, the percentage of predicting five for samples with a true speaker count of four is
for DPRNN and
for Gated LSTM. The separation modules are trained in a way that if the predicted speaker count is higher than the actual number of speakers, silent outputs should be generated for the extra channels [
15]. Thus, low prediction accuracies do not have significant effect on the output of SI-SDR.
In summary, the proposed STC and speaker separation has very high accuracy in predicting the number of speakers compared to other methods. It requires two learnable threshold values for speech and residual signals, which are trained together with speech separation parameters. OR-PIT [
12] uses a binary classifier implemented by convolutional neural networks [
28], and our STC achieved similar prediction performances compared to the complex OR-PIT model. Gated LSTM [
15] also adopted a simple thresholding detector but without learning, and our proposed method showed much higher prediction accuracies. In spite of the huge difference in prediction accuracies, the difference in SI-SDR improvements shown in
Table 3 is relatively small, because Gated LSTM does not require the exact number of speakers when separating multiple speeches. However, for applications where the exact speaker counts is required, the proposed STC should be beneficial.