SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Jiang, Yanji; Qiu, Youli; Shen, Xueli; Sun, Chuan; Liu, Haitao

doi:10.3390/app12157650

Open AccessArticle

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

by

Yanji Jiang

^1,2,

Youli Qiu

¹,

Xueli Shen

¹,

Chuan Sun

^2,3,* and

Haitao Liu

²

¹

School of Software, Liaoning Technical University, Huludao 125105, China

²

Suzhou Automotive Research Institute, Tsinghua University, Suzhou 215100, China

³

Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(15), 7650; https://doi.org/10.3390/app12157650

Submission received: 21 April 2022 / Revised: 26 July 2022 / Accepted: 26 July 2022 / Published: 29 July 2022

(This article belongs to the Special Issue Novel Methods and Technologies for Intelligent Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Speech separation is a hot topic in multi-speaker speech recognition. The long-term autocorrelation of speech signal sequences is an essential task for speech separation. The keys are effective intra-autocorrelation learning for the speaker’s speech, modelling the local (intra-blocks) and global (intra- and inter- blocks) dependence features of the speech sequence, with the real-time separation of as few parameters as possible. In this paper, the local and global dependence features of speech sequence information are extracted by utilizing different transformer structures. A forward adaptive module of channel and space autocorrelation is proposed to give the separated model good channel adaptability (channel adaptive modeling) and space adaptability (space adaptive modeling). In addition, at the back end of the separation model, a speaker enhancement module is considered to further enhance or suppress the speech of different speakers by taking advantage of the mutual suppression characteristics of each source signal. Experiments show that the scale-invariant signal-to-noise ratio improvement (SI-SNRi) of the proposed separation network on the public corpus WSJ0-2mix achieves better separation performance compared with the baseline models. The proposed method can provide a solution for speech separation and speech recognition in multi-speaker scenarios.

Keywords:

multi-speaker separation; speech separation; transformer; speaker enhancement; adaptive network

1. Introduction

For some multi-speaker scenes, people can be disabled regarding distinguishing and recognizing the speech from the mixing of multiple voices. Through speech separation technology, separating the mixed speech of multiple speakers with less mutual interference has always been the focus of research in the field of speech signal processing. Although many separation methods have been proposed, the speech separation model leaves much to be desired [1,2,3].

The same speaker’s speech features have a certain autocorrelation in time sequence, particularly in the long speech sequence vector. Therefore, modeling of the autocorrelation effectively in the mixed speech signal is essential for speech separation technology.

In recent years, data-driven deep learning methods have achieved success in the speech separation task [1,2,3,4]. In the Conv-Tasnet model proposed by Luo Y et al. [2], one-dimensional extended convolution blocks are stacked to form a time convolution network (TCN) [5] as the speech separation model, which focuses on the long-term intra-autocorrelation of speech features.

Luo Y et al. [3] proposed the DPRNN model again and leveraged the dual-path RNN network to solve the shortcomings of the Tasnet network. It is difficult to model the overall features of speech when the receptive field of a convolution network is less than the sequence length [1]. The SepFormer model proposed by Subakan C et al. [4] is a variant algorithm of DPRNN. It mainly has the advantage that self-attention in multi-head attention can conduct long-distance modeling to solve parallel computing problems. In reference [4], the SepFormer speech separation model adopts dual-path Transformer structure. Transformer [6] structure performs well in extracting long-term autocorrelation features of speech, and also shows better adaptability to data space. However, this paper argues that Transformer does not pay attention to the characteristics related to different speakers in channel features.

The purpose of this paper is to test the above hypothesis. Based on the above problems, this paper puts forward the method of a SUPERior transFORMER neural network (SuperFormer), which mainly has the following three contributions.

In mixed speech, the features of the same speaker’s speech have autocorrelation and similarity but mutual suppression among multiple speakers. For better utilization the autocorrelation of mixed speech, this paper designs a forward adaptive module to extract channel and spatial adaptive information.
Combining the dependence of channel and space, intra-SuperFormer and inter-SuperFormer are designed based on the Transformer module to fully learn the local and global features of mixed speech.
A speaker enhancement module is added to the back end of SuperFormer to take full advantage of the heterogeneous information between different speakers. The speech separation network enhances the speaker’s speech according to these features to further improve the performance of the model.

In order to prove the effectiveness of the proposed method, we performed experimental verification on the public corpus WSJ0-2mix and analyzed the results. The rest of this paper is organized as follows. The detail of SuperFormer is described in the second section. The analysis and visualization of experiments are represented in the third section, and our conclusions are shown in the final section.

2. Methods

2.1. Problem Definition

Assume that the observed mixed speech: X(t) = [X₁(t), X₂(t),…, X_n(t)]^T, which is composed of n source signals S(t) = [S₁(t), S₂(t),…, S_n(t)]^T, T stands for transpose of the matrix, the speech separation task attempts to find W to recover the source signal, as shown in Formula (1).

X(t) = WS(t),

(1)

where W can be regarded as a filter like a weight matrix between the mixed signal and the real label.

This paper proposes a separation framework of neural network to construct the prediction signal Ŝ(t) of each source signal S(t) through a given X(t), Ŝ(t) is the output value obtained by separating the network, and then there is Formula (2):

Ŝ(t) = VX(t),

(2)

Among them, the parameter in the network model is matrix V. The more the matrix V converges to W⁻¹, the more accurate the predicted value of the model output, the predicted speech Ŝ(t) will be obtained from X(t), the closer it is to the real speech S(t).

2.2. Model Design

The proposed separation model includes three parts: encoder, separation network and decoder. The block diagram of the speech separation system is shown in Figure 1.

2.2.1. Encoder

The encoder consists of a 1-D convolution and ReLU activation function. The encoder takes the mixed speech x

\in ℝ^{1 \times L}

as input with a high-dimensional feature space. The feature e of mixed speech x is further expressed through ReLU activation function, expressed by Formula (3):

e = ReLU(conv1d(x)).

(3)

where conv1d(·) represents 1-D convolution.

2.2.2. Separation Network

The separation network is the vital part of the model. It separates the features of the mixed speech and estimates the predicted values {e₁, …, e_N} of N speakers in the mixed speech, where N indicates the number of known speakers in mixed speech. The system flowchart of the separated network is shown in Figure 1.

1.: Separation Network

The feature output e by the encoder is input into the separation network. First, the normalization processing is performed, and the original data is linearly transformed to scale its value to the interval of 0 and 1 to, thus, speed up the gradient descent and finding the optimal solution. Then, a 1-D convolution is applied to reduce the dimension of the channel and integrate the features of different data channels. Our research has also found that the algorithm complexity of the self-attention module is directly proportional to the square of the length of the input sequence.

As mentioned in reference [4], processing smaller blocks in a dual-path framework mitigates the quadratic complexity of Transformers. In order to speed up the calculation of the attention mechanism module, we also segmented the input speech data and employed the dual path framework for operation. Assuming that the length of the speech sequence is L, L is divided into S blocks, the length of each block is K, the sliding step between blocks is K//2, and the obtained feature sequence is expressed as e⁽¹⁾.

When the length of the last block of the sequence is less than K, the padding method is utilized for filling. The advantage of this is to reduce the amount of calculation of matrix operations from the original O(L²) to O(K²) and O(S²).

The SuperFormer block mainly applies modules with different structures to model the dependence on block length K (intra-block) and block number S (inter-blocks) after the fusion of channel autocorrelation information and spatial autocorrelation information of speech sequence blocks. The module first extracts channel features of speech sequence blocks and then calculates the classification probability of each feature belonging to different speakers by Sigmoid classification function after extracting the channel features.

This classification probability value is called the autocorrelation weight, and then the input information before feature extraction is multiplied by this weight to obtain the final output and is called the autocorrelation information on the channel. In the same way, the spatial features of speech sequence blocks are further extracted by the channel autocorrelation information.

Then, the spatial autocorrelation information is obtained by classification, and the adaptive information on fusion channel and space is obtained. In addition, this paper represents the intra-block dependency information as local dependency information. Local dependency information is then fused with inter-block dependency information and represented as a global dependency. All in all, this model combines the local and global adaptive characteristics of the channel and space of speech blocks to improve the separation performance.

After the e⁽²⁾ output from the SuperFormer Block is activated by PReLU, we design a 2D convolution calculation to obtain e⁽³⁾, then merge the S speech blocks with length of K back to the length before blocking to obtain e⁽⁴⁾. This feature is input into the forward feedback wise network (FFW), where different activation functions are exploited for joint calculation to control the output of features. The calculation process is as shown in Formulas (4)–(7). The internal design of FFW is shown in Figure 1.

At the end of the separation network, a module for speaker enhancement is designed, which leverages the classification function on speaker features to classify different speaker features. We believe that this classification probability value is precisely derived from the autocorrelation information of speaker features. Thus, the module can enhance or suppress the features of each speaker, and finally obtains the mask e_N (N = 1, 2, …) of each estimated source speech signal. The details of the speaker enhancement module will be described in Section 3.

e⁽³⁾ = Conv2d(PReLU(e⁽²⁾)),

(4)

e⁽⁴⁾ = OverlapAdd(e⁽³⁾),

(5)

where OverlapAdd(·) denotes the process of adding small blocks of overlap back into the sequence.

The FFW is shown in Formula (6):

e⁽⁴⁾ = Conv1d(Tanh(Conv1d (e⁽⁴⁾)) + Sigmoid(Conv1d (e⁽⁴⁾))),

(6)

Finally, we split e⁽⁴⁾ into N speaker features spk₁, spk₂, …, spk_N, as Formula (7):

(spk₁, spk₂, …, spk_N) = (e⁽⁴⁾[0], e⁽⁴⁾[1],…, e⁽⁴⁾[N]).

(7)

2.: SuperFormer Block

Figure 2 shows the detailed structure of the SuperFormer Block. The SuperFormer Block includes Intra-SuperFormer and Inter-SuperFormer. After segmentation, the local and global dependencies between fusion channels and spatial adaptive features are extracted utilizing a dual-path framework. The Intra-SuperFormer module models the local dependence of the channel on the K-dimension of e⁽¹⁾, and then adds the original input of e⁽¹⁾ of the module to form the residual connection input to the Inter-Superformer module. Thus, the Inter-SuperFormer module forms spatial global dependence on sequence blocks of the S-dimension direction of z⁽¹⁾. Therefore, the overall definition of the SuperFormer Block is shown in Formula (8):

e⁽²⁾ = f_inter(P(f_intra(e⁽¹⁾) + e⁽¹⁾) + z,

(8)

where f_intra(·) and f_inter(·) represent Intra-SuperFormer and Inter-SuperFormer, respectively. P is the conversion function permute resorted to representing the data dimension—that is, to exchange the dimensions of K and S.

Forward Adaptive Module

In order to enhance the feature learning ability of the separation network, a module to enhance the adaptability of the model is designed, the Forward Adaptative Module, which has the same structure in Intra-SuperFormer and Inter-SuperFormer, as shown in Figure 3. The adaptive module exploits MLP to connect with Depth wise Convolution (Dw-Conv2d) and Depth wise Group Convolution (Dw-G-Conv2d) respectively, which helps the model to acquire corresponding knowledge from the transformed data structure [7,8], and as the expression of the local and global features of the speaker’s speech, it improves the adaptability of the model. The specific calculation process is shown in Formulas (9)–(11).

e₁⁽¹⁾ = Sigmoid (MLP(e⁽¹⁾) · e⁽¹⁾),

(9)

e₂⁽¹⁾ = Sigmoid (Norm (Dw-Conv2d (PReLU(Norm(Dw-Conv2d (e₁⁽¹⁾))))) · e₁⁽¹⁾),

(10)

e⁽¹⁾_out = (MLP(Norm(Dw-G-Conv2d (e₂⁽¹⁾))))^N,

(11)

where (·) denotes the dot product of the elements.

Intra-Transformer and Inter-Transformer

The main difference between Intra-SuperFormer and Inter-SuperFormer lies in the Feed Forward Network (FFN) in Intra-TransFormer and Inter-Transformer. A 1-D convolution and time convolution (TCN) module are designed in the Intra-TransFormer to extract the autocorrelation features in the block. The internal structure of the TCN module is illustrated in Figure 4. In Inter-TransFormer, an 1-D convolution and Bi-directional Long Short-Term Memory (Bi-LSTM) are resorted to extracting long-distance global features.

e⁽¹⁾ is input into the Intra-SuperFormer module for channel autocorrelation calculation. Through the Forward Adaptation Module, the residual connection method is adopted. Intra-Transformer (IntraT) module is repeatedly employed by M (M > 0) times to deeply extract the features inside the sequence block. Finally, the channel feature z is obtained by layer normalization. As shown in Formula (12):

z = Norm(IntraT(FA (e⁽¹⁾) + e⁽¹⁾)^M),

(12)

where FA (·) denotes the forward adaptive module.

The specific definition of IntraT is shown in Formula (13):

IntraT = Dropout (FFN_intra (Norm (MHA (Norm (·)) + ·))),

(13)

where · represents the original input information of the module. MHA (·) is Multi-Head Attention. Norm (·) represents the layer norm. The specific definition of FFN_intra is Formula (14):

FFN_intra = TCN(ReLU(Conv1d(·))).

(14)

Similarly, the calculation process of Inter-SuperFormer is similar to that of Intra-SuperFormer. When calculating spatial autocorrelation, the channel features e⁽¹⁾ of mixed speech is also integrated:

z⁽¹⁾ = z + e⁽¹⁾,

(15)

The calculation process is shown in Formulas (16)–(18):

z⁽²⁾ = Norm(InterT (FA (z⁽¹⁾) + z⁽¹⁾)^M),

(16)

InterT = Dropout(FFN_inter (Norm(MHA(Norm(·)) +))),

(17)

FFN_inter = Bi-LSTM(ReLU(Conv1d(·))),

(18)

e⁽²⁾ = z + z⁽²⁾.

(19)

As shown in Formulas (15) and (19), two residual connections are designed in the whole SuperFormer Block to retain the original input information and improve the back-propagation calculation of the model.

3.: Speaker Enhancement Module

At the end of the separation network, the post-processing operation of the separated speech is added, and the speaker enhancement module is designed. This module inputs the separated features of the separation network of the convolution layer, and further extracts the long-term features of the weight-sharing features of CNN. The probability that the speech features belong to a particular speaker is then calculated by Sigmoid for dichotomies or Softmax activation functions for multi-dichotomies.

Thus, the powerful feature extraction and weight sharing features of CNN are utilized to enhance the speech features of the same speaker. The module structure is depicted as Figure 5. Taking the separation of two speakers’ mixed speech as an example. The fused features of two speakers are input into a deep 2D convolution to calculate a mask, which is applied to enhance the speech features belonging to the same speaker and suppress the speech features of other speakers at the same time, as shown in Formula (20).

(spk₁, spk₂) = (spk₁ · Mask(spk₁, spk₂), spk₂ · Mask(spk₁, spk₂)),

(20)

where Mask(·) represents the collection of Concatenation, Conv2d, Normalization and Sigmoid activation operations, and · represents the dot product.

spk₁ and spk₂ are stacked together to form the final output e_N, and N represents the Nth speaker as shown in Formula (21).

(e₁, e₂) = (spk₁, spk₂₎.

(21)

2.2.3. Decoder

The last part of the separation system is the decoder, which takes the output of encoder and the output of the separation network together as the input of the decoder. It adopts 1-D transpose convolution to turn the features of different channels back to the features of a single channel to reconstruct the separated speech signal. The stride and kernel size of the decoder are consistent with encoder. The speech is restored to the length before input to the encoder. The input of the decoder is the dot product of the output e_N of the separation network and the output e of the encoder. The speech signal of the separated Nth speaker is defined by the following Formula (22):

ŝ_N = conv1d-transpose (e_N · e).

(22)

where conv1d-transpose (·) denotes 1-D transpose convolution.

3. Experiments and Results

3.1. Dataset

The WSJ0-2mix dataset is commonly used in the research of speech separation [9,10] The speech of different speakers is randomly selected in the dataset (WSJ0). The Signal-to-Noise Ratio (SNR) is randomly mixed between 0~5 dB to generate multi-speaker mixed speech data. The dataset includes a 30 h 22 min training set, a 7 h 40 min verification set and a 4 h 49 min test set.

3.2. Network Parameters and Experiment Configurations

3.2.1. Network Configurations

Network parameters are the critical factors that affect the performance of the model. Here are some reference settings. The encoder utilizes Conv1d with a size of 256 and the ReLU activation function. The kernel size is 8, and the step string is 4. The dimension of the channel in the separation network is 128, the size of the block K is chosen as 250, the sliding step of the block is K//2, the SuperFormer Block cycle T = 2, the forward adaptive module is repeated four times, the Intra/Inter-Transformer is repeated M = 8 times, and the speaker enhancement process is executed once. The configuration of decoder convolution layer is consistent with that of the encoder.

The training objective for the network is scale-invariant signal-to-noise ratio (SI-SNR). Utterance-level permutation invariant training (uPIT) is applied during training to address the source permutation problem [11]. SI-SNR is defined as Formula (23):

S_{T} = \frac{\hat{S} \cdot S}{{∥ S ∥}^{2}} S S_{E} = \hat{S} - S_{T} SI - SNR = 10 \log_{10} \frac{∥ S_{T} ∥^{2}}{∥ S_{E} ∥^{2}}

(23)

where Ŝ

\in ℝ^{1 \times T}

and S

\in ℝ^{1 \times T}

are the model output and original clean sources, respectively.

3.2.2. Experiment Configurations

In the process of training the model, SI-SNRi is employed as the loss function, and the batch size is 1. AdamW is applied as the optimizer of the algorithm, and the weight decay is 1 × 10⁻⁴. The initial value of the learning rate is 2.5 × 10⁻⁴, and if the loss value of the model verification set does not decrease compared with the best loss value in two consecutive iteration cycles, the learning rate is attenuated to half of the original learning rate. If the performance of the model cannot be improved for 10 consecutive iteration cycles, the training will be stopped. At the same time, in order to prevent over fitting, L2 regularization processing is added, which is 5 in the initial stage, 20 in the middle stage and 100 in the later stage of training, and the dropout gradually increases—that is, 0, 0.1 and 0.5.

3.3. Introduce Previous Methods and Evaluation Metrics

At present, there are two kinds of mainstream methods to solve the speech separation task. One is the separation method based on frequency domain, and the other is to separate directly in time domain.

On basis of the deep clustering framework, DPCL++ [12] introduced better regularization, larger temporal context, and a deeper architecture.
DPCL model was added to realize the end-to-end training of signal reconstruction quality of the first time.
uPIT-BLSTM-ST [11] added an utterance-level cost function to Permutation Invariant Training (PIT) technology to extend it, eliminating the additional permutation problem that frame level PIT needs to solve in the inference process.
DANet [13] created attractor points by finding the centroids of the source in the embedded space, and then utilieze these centroids to determine the similarity between each bin in the mixture and each source, achieving end-to-end real-time separation on different numbers of mixed sources.
ADANet [14] solved the problem of DANet’s mismatch between training and testing, providing a flexible solution and a generalized Expectation–Maximization strategy to determine attractors assigned from estimated speakers. This method maintains the flexibility of attractor formation at the discourse level and can be extended to variable signal conditions.
Conv-Tasnet [2] proposed a fully convolution time-domain end-to-end speech separation network, it is a classic network structure with outstanding effect.
DPRNN [3] proved that the deep structure of RNN layers organized by dual paths can better model extremely long sequences to improve separation performance.
SepFormer [4] proved the effectiveness of learning short-term and long-term dependencies using a multi-scale method with dual-path transformers.

In this paper, scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi) are reported as objective measures of separation accuracy. In addition to the distortion indicators, the quality of the separated mixtures is assessed utilizing perceptual evaluation of subjective quality (PESQ), short-time objective intelligibility (STOI) and mean opinion score-listening quality objective (MOS-LQO).

3.4. Comparison with Previous Methods

The separation performance of the proposed method is compared with that of the previous methods. As shown in Table 1, it is the performance of the classical separation model in WSJ0-2mix dataset in recent years. The missing values in the table are because the numbers is not reported in the study. The evaluation criteria include SI-SDRi, SDRi, model parameters and stride size. It can be seen from the table that SuperFormer only employs 12.8 M parameters, SI-SNRi reaches 17.6 and SDRi reaches 17.8. Compared with DPCL++ model with almost the same number of parameters, the performance is greatly improved.

In addition, we conducted comparative experiments with our own system. As shown in Table 1, we apply the improved dual-path Transformer structure as the baseline system for SuperFormer. The baseline system can still produce 15.2 dB separation performance with half the number of parameters. In this paper, the forward adaptability module (FA module) is proposed for Transformer structure to ignore the problem of channel adaptability modeling. After the addition of FA module, the performance of baseline system is significantly improved by 1.6 dB.

In addition, the speaker enhancement module (SE module) proposed in this paper improves the performance by 0.8 dB after adding SE module to FA module in the baseline system. Overall, our proposed modules can improve the separation performance of the baseline system by 2.4 dB. In conclusion, the proposed module can effectively improve the speaker autocorrelation modeling ability of the model.

However, some of the previous methods have large number of model parameters and the separation performance is not ideal. The number of parameters in some models is relatively small and the separation performance is not ideal. Some models have better separation performance, but the number of parameters is larger than that in this paper. Therefore, considering the number of parameters and separation performance, the method presented in this paper is superior to other methods.

In Table 1, the separation results of common methods are compared. In addition, the ablation experiment is used to verify the effect of the module. For the SuperFormer without FA and SE modules, the system still obtains the separation result of 15.2 dB. After the FA module is added, the effect of the model is improved to 16.8 dB, indicating that the model has played a facilitator role in promoting the channel adaptability of speech.

Furthermore, after the SE module is added, the model performance is significantly improved by 0.8 dB again, and the parameter quantity is also increased to 12.8 M. In general, the experiment verifies the separation performance of the model, and the proposed model has also achieved the expected results. Compared with other models, the proposed model has the advantages of excellent separation performance and less model parameters.

The model achieved higher performance than the baseline system SuperFormer on the open corpus WSJ0-2mix. The forward adaptive module is helpful for the Transformer model to extract the channel and spatial features of speech and further improves the Si-SNRi, SDRi values of speech separation. The speaker enhancement module also improves the subjective evaluation score of separated speech. Similar separation effects employ less parameters than other models.

Table 2 compares the PESQ scores of SuperFormer and methods before. The value is generally in the range of −0.5 to 4.5. The higher the score, the better the speech quality. In Table 2, the PESQ score of SuperFormer is 3.43, which is 76% of Clean, is 17% higher than DANet-Kmeans, is 18% higher than DANet-Fixed, is 13% higher than ADANet-6-do and is 4% higher than Conv-TasNet. The results show that SuperFormer has a certain improvement in speech quality compared with methods before. However, there is still a large gap from Clean.

In addition to SI-SDRi, the SDRi and PESQ, STOI and MOS-LQO indicators are also considered to evaluate speech quality. Table 3 shows the STOI and MOS-LQO scores of SuperFormer. STOI reflects the objective evaluation of human auditory perception on speech intelligibility. The STOI value is between 0 and 1. The larger the value, the higher the speech intelligibility and the more complete and clearer the speech. In order to evaluate the subjective hearing quality, we exploited the objective measurement technology to evaluate the subjective hearing quality, namely MOS-LQO technology. Table 3 shows that although the model proposed in this paper fails to surpass Conv-TasNet in MOS value (substituted by MOS-LQO), it still maintains good speech clarity and integrity.

In conclusion, it can be seen from the experimental results that the separation model proposed in this paper can generate excellent Si-SNRi and SDRi on WSJ0-2mix dataset, and has good PESQ, STOI and MOS-LQO scores. The experiment also proves that the FA module proposed in this paper can effectively improve the channel and spatial adaptability of the model and thus improve the separation effect, and the SE module can learn the speaker’s information according to the autocorrelation of the speaker’s speech in the mixed speech for further separation enhancement processing. The proposed method can provide a solution to speech separation and speech recognition in multi-speaker scenarios, and has the advantages of better separation effect and less model parameters.

3.5. Visual Results

In this paper, an enhanced multi-speaker speech separation network combining channel and spatial adaptability, SuperFormer, has been proposed. The overall visual structure of the SuperFormer is illustrated in Figure 6. Firstly, a forward adaptive module is constructed in the separation network. The module combines MLP with deep convolution structure to enhance the perception of speaker speech features.

The autocorrelation of speaker speech signal sequence is an important basis for speech separation. Effectively learning the autocorrelation of speech sequences is a difficult task in speech separation. This autocorrelation includes the channel and spatial features of speech. Based on the feature autocorrelation of the same speaker, the model enhances the speaker’s speech features in the mixed speech to, thus, improve the performance of speech separation.

In previous studies, Transformer’s Multi-head Self-Attention Mechanism can adaptively model long-distance speech sequences. Using the learning ability of transformer module, extract the local and global features of speech, and design Intra-Superformer and Inter-Superformer modules to establish the channel and spatial relationship of voice data to, thus, make up for the lack of spatial adaptability of Self-Attention Mechanism. Finally, the method of Dw-Conv2d and Normalization is resorted to establish the speaker enhancement module, which exploits the autocorrelation information learned by the separation network to further enhance or suppress the separated speech.

In the process of model separation calculation, in order to observe the effect of each stage of the model, the speech spectrogram is used to visualize the separation process of encoder, separation network and decoder. As illustrated in Figure 6, the number in the figure represents the change of channel dimension. The features of the encoded mixed speech are sent to the separation network for channel and spatial autocorrelation modeling, and the features are separated. After the separation result passes through the speaker enhancement module, the features are finally constructed into a speech signal by the decoder.

In order to verify the separation accuracy of the model on the whole dataset, the labels of the separated speech data are compared with the real speech data, as illustrated in Figure 7. The scatter diagram shows the results of comparing the model separated speech with the real speaker tag, which can also be used as a way to evaluate the effect of model separation. The more the predicted value of the separation model overlaps with the tag value in the scatter diagram, the closer the separated speech is to the original speech. In other words, the stronger the separation performance of the model is.

4. Conclusions

This paper proposed a SUPERior transFORMER model (SuperFormer) that integrates the channel autocorrelation and spatial autocorrelation of speech data. In this model, the input long speech sequence was divided into small blocks for processing, the forward adaptability module was designed to improve the adaptability, and different converter structures were designed to extract the local and global features of the speech sequence information by taking advantage of the Multi-head Self-Attention mechanism in obtaining the data features to, thus, establish the correlation of the long speech sequence.

At the end of the separation model, the speaker enhancement module was added to further enhance or suppress the speech of different speakersusing the mutual suppression characteristics of each source speech in the mixed speech. The experimental results show that the model outperformed other methods in speech separation tasks and had the advantage of less parameters.

In addition, the mixing of noise will affect the accuracy of the model. In future research, we hope to explore the speech separation with multiple microphones containing noise interference with an uncertain number of speakers.

Author Contributions

Conceptualization, Y.J. and Y.Q.; software, Y.Q.; resources, Y.J., X.S. and C.S.; writing—original draft preparation, Y.J. and Y.Q.; writing—review and editing, Y.J., Y.Q., X.S., C.S. and H.L.; supervision, Y.J., X.S., C.S. and H.L.; visualization, Y.Q.; funding acquisition, Y.J., X.S., C.S. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52002215, 12104153; the Jiangsu Science and Technology Project, grant number BE2021011-4; the Research Project of Hubei Provincial Department of Education, grant number D20202902, D20212901; the Hubei Science and Technology Project, grant number 2021BEC005, 2021BLB225 and the Hong Kong Scholars Program, grant number XJ2021028; the Research Project of Liaoning Provincial Department of Education, grant number LJKZ0338, LJ2020FWL001; the Supported by China Postdoctoral Foundation, grant number 2021M701963.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are openly available in CSR-I (WSJ0) Complete at https://doi.org/10.35111/ewkm-cg47 (accessed on 11 December 2021), reference number [9].

Conflicts of Interest

The authors declare no conflict of interest.

References

Luo, Y.; Mesgarani, N. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 696–700. [Google Scholar] [CrossRef] [Green Version]
Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–8 May 2020; pp. 46–50. [Google Scholar] [CrossRef] [Green Version]
Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is All You Need in Speech Separation. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021. [Google Scholar] [CrossRef]
Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks: A Unified Approach to Action Segmentation. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10, 15–16 October 2016; pp. 47–54. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
Garofolo, J.S.; Graff, D.; Paul, D.; Pallett, D. CSR-I (WSJ0) Complete LDC93S6A. Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar] [CrossRef]
Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S.J. Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 31–35. [Google Scholar] [CrossRef] [Green Version]
Kolbaek, M.; Yu, D.; Tan, Z.H.; Jensen, J. Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio Speech Lang. Processing 2017, 25, 1901–1913. [Google Scholar] [CrossRef] [Green Version]
Isik, Y.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-channel multi-speaker separation using deep clustering. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 545–549. [Google Scholar] [CrossRef] [Green Version]
Chen, Z.; Luo, Y.; Mesgarani, N. Deep attractor network for single microphone speaker separation. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 246–250. [Google Scholar] [CrossRef] [Green Version]
Luo, Y.; Chen, Z.; Mesgarani, N. Speaker-independent Speech Separation with Deep Attractor Network. IEEE/ACM Trans. Audio Speech Lang. Process 2018, 26, 787–796. [Google Scholar] [CrossRef]
Xu, C.L.; Rao, W.; Xiao, X.; Chng, E.S.; Li, H.Z. Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar] [CrossRef]
Li, C.X.; Zhu, L.; Xu, S.; Gao, P.; Xu, B. CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 711–715. [Google Scholar] [CrossRef]
Wang, Z.Q.; Roux, J.L.; Hershey, J.R. Alternative objective functions for deep clustering. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 686–690. [Google Scholar] [CrossRef]
Wang, Z.Q.; Le Roux, J.; Wang, D.L.; Hershey, J. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar] [CrossRef] [Green Version]
Luo, Y.; Mesgarani, N. Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Overall technical route of speech separation system.

Figure 2. Structure design of the SuperFormer Block, including the Intra-SuperFormer and Inter-SuperFormer with different structures. The difference between the two modules lies in the structure of FFN module in transformer. The Intra-Transformer uses a TCN layer to learn the intra -block data correlation of speech, and the Inter-Transformer uses an RNN network to learn the inter -block data correlation of speech.

Figure 3. The Forward Adaptive Module structure is designed to enhance the adaptability of the model to features.

Figure 4. Structural design of the TCN module, which is designed for features extraction of intra-block speech.

Figure 5. The internal structure of the speaker enhancement module, as the post-processing module of the model, uses the temporal correlation characteristics of the same speaker’s speech to improve the separation performance of the model.

Figure 6. Visualization of speech separation process. Taking the mixing of two speakers’ speech as an example, the speaker’s speech is displayed in red and blue respectively. The number in the figure represents the change of channel dimension, in which the encoder adopts 1 × 256 convolution coding. Through the autocorrelation modeling of SuperFormer, the separated speech feature map is obtained. It can be clearly seen that the speaker enhancement module improves the separation effect. Finally, the decoder applies 256 × 1 transpose convolution to decode and obtain the separated speech waveform.

Figure 7. Comparison diagram of the predicted speaker speech and real speaker tag. (a) Comparison of speaker 1’s separation results and labels; (b) Comparison of speaker 2’s separation results and labels. The coincidence degree between two-speaker speech separated by the model and the original speech data tag is ideal, which verifies the generalization effect of the model.

Table 1. Comparison with other methods on the WSJ0-2mix dataset.

Model	SI-SNRi	SDRi	#Param	Stride
uPIT-BLSTM-ST [11]	-	10.0	92.7 M	-
cuPIT-Grid-RD [15]	-	10.2	47.2 M	-
ADANet [14]	10.4	10.8	9.1 M	-
DANet [13]	10.5	-	9.1 M	-
DPCL++ [12]	10.8	-	13.6 M	-
CBLDNN-GAT [16]	-	11.0	39.5 M	-
Tasnet [1]	10.8	11.1	n.a	20
Chimera++ [17]	11.5	12.0	32.9 M	-
WA-MISI-5 [18]	12.6	13.1	32.9 M	-
BLSTM-TasNet [19]	13.2	13.6	23.6 M	-
Conv-Tasnet [2]	15.3	15.6	5.1 M	10
SuperFormer	15.2	15.3	8.1 M	4
+FA	16.8	16.9	10.6 M	4
+FA+SE	17.6	17.8	12.8 M	4

+FA refers to employing the Forward Adaptive Module. +SE denotes the Speaker Enhancement Module.

Table 2. PESQ scores on the WSJ0-2mix dataset.

Dataset	PESQ
Dataset	DANet-Kmeans	DANet-Fixed	ADANet-6-do	Conv-TasNet	SuperFormer	Clean
WSJ0-2mix	2.64	2.57	2.82	3.24	3.43	4.5

Table 3. STOI and MOS-LQO scores on the WSJ0-2mix dataset.

Dataset	SuperFormer		Conv-TasNet-gLN
Dataset	STOI	MOS-LQO	MOS	Clean
WSJ0-2mix	0.97	3.45	4.03	4.23

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Y.; Qiu, Y.; Shen, X.; Sun, C.; Liu, H. SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability. Appl. Sci. 2022, 12, 7650. https://doi.org/10.3390/app12157650

AMA Style

Jiang Y, Qiu Y, Shen X, Sun C, Liu H. SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability. Applied Sciences. 2022; 12(15):7650. https://doi.org/10.3390/app12157650

Chicago/Turabian Style

Jiang, Yanji, Youli Qiu, Xueli Shen, Chuan Sun, and Haitao Liu. 2022. "SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability" Applied Sciences 12, no. 15: 7650. https://doi.org/10.3390/app12157650

APA Style

Jiang, Y., Qiu, Y., Shen, X., Sun, C., & Liu, H. (2022). SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability. Applied Sciences, 12(15), 7650. https://doi.org/10.3390/app12157650

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SuperFormer: Enhanced Multi-Speaker Speech Separation Network Combining Channel and Spatial Adaptability

Abstract

1. Introduction

2. Methods

2.1. Problem Definition

2.2. Model Design

2.2.1. Encoder

2.2.2. Separation Network

2.2.3. Decoder

3. Experiments and Results

3.1. Dataset

3.2. Network Parameters and Experiment Configurations

3.2.1. Network Configurations

3.2.2. Experiment Configurations

3.3. Introduce Previous Methods and Evaluation Metrics

3.4. Comparison with Previous Methods

3.5. Visual Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI