Iteratively Refined Multi-Channel Speech Separation

Zhang, Xu; Bao, Changchun; Yang, Xue; Zhou, Jing

doi:10.3390/app14146375

Open AccessArticle

Iteratively Refined Multi-Channel Speech Separation

Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6375; https://doi.org/10.3390/app14146375

Submission received: 20 May 2024 / Revised: 9 July 2024 / Accepted: 20 July 2024 / Published: 22 July 2024

(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

Download

Browse Figures

Versions Notes

Abstract

The combination of neural networks and beamforming has proven very effective in multi-channel speech separation, but its performance faces a challenge in complex environments. In this paper, an iteratively refined multi-channel speech separation method is proposed to meet this challenge. The proposed method is composed of initial separation and iterative separation. In the initial separation, a time–frequency domain dual-path recurrent neural network (TFDPRNN), minimum variance distortionless response (MVDR) beamformer, and post-separation are cascaded to obtain the first additional input in the iterative separation process. In iterative separation, the MVDR beamformer and post-separation are iteratively used, where the output of the MVDR beamformer is used as an additional input to the post-separation network and the final output comes from the post-separation module. This iteration of the beamformer and post-separation is fully employed for promoting their optimization, which ultimately improves the overall performance. Experiments on the spatialized version of the WSJ0-2mix corpus showed that our proposed method achieved a signal-to-distortion ratio (SDR) improvement of 24.17 dB, which was significantly better than the current popular methods. In addition, the method also achieved an SDR of 20.2 dB on joint separation and dereverberation tasks. These results indicate our method’s effectiveness and significance in the multi-channel speech separation field.

Keywords:

speech separation; microphone array; minimum variance distortionless response (MVDR); beamforming; iterative separation

1. Introduction

Currently, speech separation technology is playing a crucial role in human–computer interaction, audio processing, and communication systems [1,2]. With the advancement of technology, especially in deep learning, significant progress has been made in achieving efficient speech separation, especially in the fields of single-channel based speech separation [3,4,5,6]. Although single-channel based methods perform well in some environments, their effectiveness is limited in cases of more complex acoustic environments. Therefore, in order to overcome these limitations and further improve the performance of speech separation, multi-channel speech separation methods [7] have been explored. Traditional multi-channel speech separation methods, such as delay-and-sum beamforming [8], work well in certain situations. However, in reverberant environments, with the increase of sound sources, it is often difficult for the traditional methods to effectively separate speech signals. The difficulty is mainly focused on the fact that traditional beamforming techniques rely on relatively simple signal processing strategies, which are not ideal for dealing with dynamic changes of sound source positions or for dealing with complex and variable acoustic environments.

Due to the limitations of conventional multi-channel methods, particular attention has been paid to approaches based on neural beamforming, combining the powerful nonlinear modeling capabilities of neural networks with the action of a beamformer. In classical neural beamforming, the neural network is used to obtain initial speech separation. Subsequently, this initially separated speech and the original speech are together used in the beamformer to compute spatial covariance matrices (SCMs) [9]. In addition, the post filter cascaded to the beamformer is incorporated to further optimize the quality and intelligibility of the separated speech [10]. Compared with conventional beamforming methods, neural beamforming has been demonstrated to have significant advantages in dealing with complex acoustic environments, which has made it a mainstream method in research and application in recent years [9,11,12,13].

For example, a masking-based neural beamforming method was developed in [11] and [13], in which multiple single-channel long short-term memory (LSTM) networks were first used to estimate the masks of the speakers, and these masks were then used to estimate the SCM of speech and noise for the MVDR beamformer. This kind of method has shown a significant improvement in the performance of speech separation compared with conventional beamforming methods. In addition, a signal-based neural beamforming method was proposed [9], in which a time-domain audio separation network (TasNet) was used to pre-separate the speech, and the separated speech was used to calculate the SCM used in the MVDR beamformer. This method achieved better performance than using an ideal ratio mask in the MVDR beamformer [9]. Research work [14] indicated that reverberation had a significant impact on the separation while using the TasNet. This observation inspired us to explore a speech separation method with anti-reverberation ability, aiming at achieving a more accurate SCM for use in the MVDR beamformer for improving beamforming performance. Consequently, in our previous work [15], a time–frequency domain dual-path recurrent neural network (TFDPRNN) was proposed for achieving better performance of speech separation in a reverberant environment. A significant performance improvement was achieved by combining the MVDR beamformer and TFDPRNN (called Beam–TFDPRNN). Although these neural beamforming methods have a good performance, they are still restricted to the linear filtering operation and their performance is limited. Therefore, in this paper, we explore other ways to improve the performance of neural beamforming.

In recent years, a notable trend in speech separation has focused on iteratively refined structures. In the field of single-channel speech separation, demonstrated results [16,17,18] have shown that the accuracy of speech separation can be significantly improved through iterative optimization. In the field of multi-channel speech separation, the introduction of iterative structures has also demonstrated great potential. For example, an MVDR beamformer and TasNet were used as the iterative structure in [19], a time-domain real-valued generalized Wiener filter (TD-GWF) and TasNet were used as the iterative structure in [14], and a time-domain dilated convolutional neural network (TDCN) and multi-channel Wiener filter (MCWF) were used as the iterative structure in [20]. The performance of these methods has improved a lot compared with the no-iteration version. Therefore, in this paper, we also consider using an iterative structure to improve the performance of the neural beamforming. In this paper, we greatly extend the iterative version of our previous work [15], and a new neural beamforming structure is proposed, that is, improved Beam-TFDPRNN (iBeam-TFDPRNN). The main contributions of this paper are summarized as follows:

The structure of the original neural beamforming is revised. Specifically, two main changes are made. First, the number of the time-frequency domain path scanning blocks in the neural network is reduced to three from the original six. This simplification improves the training efficiency and inference speed of the model, while reducing the complexity and resource consumption of the model. Second, an iteratively refined separation method is proposed, which combines the initially separated speech with the original mixed signal as an auxiliary input for the iterative network. By repeating this process in N iteration stages, the MVDR beamformer and post-separation network are mutually promoted. As a result, the separation results are effectively improved;
The proposed method not only evaluates each stage of the multi-stage iterative processes, but also uses more evaluation metrics to obtain a more comprehensive evaluation. The experimental results show that the proposed method worked well on the spatialized version of the WSJ0-2mix data corpus and greatly outperformed the current popular methods. In addition, it is noted that our proposed method also performed well in the dereverberation task.

The rest of this paper is organized as follows: Section 2 presents the details of our proposed method. Section 3 describes the experimental setup and analysis. Finally, Section 4 concludes the paper.

2. Proposed Method

In this section, the proposed iBeam-TFDPRNN is introduced. First, the signal model that provides a foundation for the subsequent discussions is described. Then, the architecture of iBeam-TFDPRNN and the loss function used in the proposed method are given.

2.1. Signal Model

In this paper, a far-field signal model with Q speakers in the time domain is considered as follows:

x_{c} = \sum_{q = 1}^{Q} y_{q, c} = \sum_{q = 1}^{Q} s_{q} * h_{q, c}

(1)

where x_c denotes the signal received by the c^th microphone, 1 ≤ c < C, C denotes the total number of the microphones, y_q,c denotes the signal captured by the c^th microphone corresponding to the q^th speaker, s_q denotes the original source signal of the q^th speaker, and h_q,c denotes the room impulse response (RIR) from the q^th speaker to the c^th microphone.

2.2. Initial Separation

In Figure 1, the structure of the initial separation is shown. It is composed of three parts, i.e., the TFDPRNN used for pre-separation with one input (mixed signal), the MVDR beamformer, and post-separation consisting of the same TFDPRNN with two inputs (mixed signal and the output of the MVDR beamformer). In this structure, the mixed signal from multiple microphones is first fed into this pre-separation network. Then, the pre-separated signal and the mixed speech signal are used to obtain the statistical information of the MVDR beamformer. Finally, the post-separation network is arranged at the backend of the beamformer to obtain the finally refined initially separated speech. The following is a detailed introduction to the initial separation.

During the pre-separation, the signals collected by each microphone are individually fed into the respective TFDPRNN modules, where each network adopts a classical encoder–separator–decoder structure for the processing.

(a): In the encoder section, firstly, the mixed signal x_c is transformed into the time–frequency representation Y_c by short-time Fourier transform (STFT). Then, this representation Y_c is applied to the dynamic range compression (DRC) module to obtain Y_DRC. Subsequently, the local features Y_Conv are extracted from Y_DRC through a 2D convolutional (Conv2D) layer. Finally, these features Y_Conv are passed through a rectified linear unit (ReLU) activation function to obtain the encoded feature E_c. The whole encoder section can be expressed as:

$E_{c} = Encoder {x_{c}}_{c}$

(2)

where $Encoder {.}_{c}$ corresponds to the encoder of the c^th microphone and E_c denotes the representation of encoder of the c^th microphone.
(b): In the separator section, firstly, the encoded feature E_c is sent to the layer normalization (LN) for standardization, followed by a Conv2D layer to obtain the feature ${\hat{Y}}_{c o n v}$ . Subsequently, the feature ${\hat{Y}}_{c o n v}$ is sent to N time–frequency domain scanning blocks using a time–frequency scanning mechanism [21,22]. Each scanning block consists of two recurrent modules, where the first recurrent module utilizes a bi-directional LSTM (BLSTM) network layer along the frequency axis and the second recurrent module utilizes BLSTM along the time axis. Both modules include reshaping, LN, and fully connected (FC) operations. Finally, after processing through these modules, the features are further refined through a Conv2D layer and the ReLU activation function, resulting in the separated mask ${\hat{M}}_{q, c}$ . The whole separator section can be expressed as:

${\hat{M}}_{q, c} = Separator {E_{c}}_{c}$

(3)

where Separator{.}_c denotes the separator corresponding to the signal of the c^th microphone and ${\hat{M}}_{q, c}$ denotes the mask of the q^th speaker in the c^th microphone.
Thus, the separated masks are multiplied element-wise with the encoded feature E_c to obtain the separated feature representation ${\hat{E}}_{q, c}$ :

${\hat{E}}_{q, c} = {\hat{M}}_{q, c} ⊙ E_{c}$

(4)

where E_c denotes the encoded feature representation in the c^th microphone, ☉ denotes the Hadamard product.
(c): In the decoder section, the separated feature ${\hat{E}}_{q, c}$ passes through a Conv2D layer, inverse DRC (IDRC), and inverse STFT (ISFFT) to obtain the finally separated waveforms ${\hat{y}}_{q, c}^{(0)}$ , where the superscript (0) denotes the first stage. The whole decoder section can be expressed as:

${\hat{y}}_{q, c}^{(0)} = D e c o d e r {{\hat{E}}_{q, c}}_{q}$

(5)

where Decoder{.}_q denotes the decoder of the q^th speaker and ${\hat{y}}_{q, c}^{(0)}$ denotes the separated waveform of the q^th speaker in the c^th microphone.

During the MVDR beamforming, these separated waveforms are employed in the computation of the relevant SCM in the MVDR beamformer, i.e.,

{\hat{Φ}}_{f}^{Target} = \frac{1}{T} \sum_{t = 1}^{T} {\hat{Y}}_{q, t, f} {\hat{Y}}_{q, t, f}^{H}

(6)

{\hat{Φ}}_{f}^{Interfer} = \frac{1}{T} \sum_{t = 1}^{T} (Y_{t, f} - {\hat{Y}}_{q, t, f}) {(Y_{t, f} - {\hat{Y}}_{q, t, f})}^{H}

(7)

where

{\hat{Φ}}_{f}^{Target} \in ℂ^{C \times C}

and

{\hat{Φ}}_{f}^{Interfer} \in ℂ^{C \times C}

represent the SCMs of the speech and interference sources.

{\hat{Y}}_{q, t, f} \in ℂ^{C \times 1}

is the estimated clean signal vector composed of the STFT coefficients of C microphones at the time–frequency bins, which is computed from the output signals of multiple TFDPRNN modules for the q^th speaker.

{\hat{Y}}_{t, f} \in ℂ^{C \times 1}

is the multi-channel signal vector, which also consists of the STFT coefficients of C microphones at the time–frequency bins. The notation H denotes the Hermitian transposition operation. Based on these SCMs, the MVDR beamformer’s weights can be obtained as follows:

w_{f} = \frac{{({\hat{Φ}}_{f}^{Interfer})}^{- 1} {\hat{Φ}}_{f}^{Target}}{Τ r {{({\hat{Φ}}_{f}^{Interfer})}^{- 1} {\hat{Φ}}_{f}^{Target}}} u

(8)

where ⁻¹ denotes the inverse of a matrix, Tr{ } denotes the trace of a matrix, which is the sum of the diagonal elements of the matrix, and u denotes a one-hot vector representing the reference microphone.

Subsequently, the signal separated by the MVDR beamformer is given as:

{\hat{z}}_{q}^{(1)} = ISTFT {w_{f}^{H} Y_{t, f}}

(9)

where

{\hat{z}}_{q, c}^{(1)}

denotes the estimated signal of the q^th speaker in the initial separation and ISTFT{ } denotes the ISTFT operation.

During the post-separation process, the output

{\hat{z}}_{q}^{(1)}

is regarded as additional input along with the original mixed signal xc and sent to the post-separation network to obtain the initial separation output

{\hat{y}}_{q, c}^{(1)}

. Importantly, the post-separation network has the same structure as the pre-separation network.

2.3. Iterative Separation

The overall structure of the iBeam-TFDPRNN is shown in Figure 2. This new structure is divided into two stages, i.e., initial separation and iterative separation. The initial separation is described in Section 2.2. The iterative separation contains an MVDR beamformer and a post-separation network. For convenience, the pre-separation prior to the initial separation is called stage 0, the first combination of MVDR beamformer and post-separation after stage 0 is called stage 1. In the subsequent iteration separation, the first iteration is called stage 2, the second iteration is called stage 3, and so on.

Specifically, in the first iteration, the initially separated signals

{{\hat{y}}_{q, c}^{(1)}}_{q = 1, c = 1}^{Q, C}

of Q speakers on all C microphones and multi-channel mixed signals

{x_{c}}_{c = 1}^{C}

are sent to the MVDR beamformer to obtain output

{{\hat{z}}_{q}^{(2)}}_{q = 1}^{Q}

, then, the output

{{\hat{z}}_{q}^{(2)}}_{q = 1}^{Q}

and the original mixed signal

{x_{c}}_{c = 1}^{C}

are sent to the post-separation network to obtain output

{{\hat{y}}_{q, c}^{(2)}}_{q = 1, c = 1}^{Q, C}

. In the second iteration, the output

{{\hat{y}}_{q, c}^{(2)}}_{q = 1, c = 1}^{Q, C}

and mixed signals

{x_{c}}_{c = 1}^{C}

are sent to the MVDR beamformer of the next stage to obtain output

{{\hat{z}}_{q}^{(3)}}_{q = 1}^{Q}

, then, the output

{{\hat{z}}_{q}^{(3)}}_{q = 1}^{Q}

and the original mixed signal

{x_{c}}_{c = 1}^{C}

are sent to the post-separation network of the next stage to obtain output

{{\hat{y}}_{q, c}^{(3)}}_{q = 1, c = 1}^{Q, C}

, and this is continued by repeating the process N times. From this iterative separation, we can see that the post-separation output serves as an additional input for the MVDR beamformer in the next stage. Through this iterative loop, the results of the MVDR beamformer and the post-separation are fully employed to promote their individual optimization and ultimately improve the overall performance.

2.4. Loss Function

The loss function is used to calculate the signal-to-distortion ratio (SDR) between the separated signal and the original signal in the first stage and at each subsequent iteration, and these SDRs are then added to calculate the total loss. The joint loss function can be expressed as follows:

L o s s = - S D R ({\hat{y}}_{q, c}^{(0)}, y_{q, c}) - S D R ({\hat{y}}_{q, c}^{(1)}, y_{q, c}) - S D R ({\hat{y}}_{q, c}^{(2)}, y_{q, c})

(10)

where

S D R (y, s) = 10 l o g_{10} (\frac{{‖ s ‖}^{2}}{{‖ s - y ‖}^{2}}),

(11)

denotes the SDR calculation operator, s and y denote the reference and the separated speech signals, respectively,

y_{q, c}

denotes the original clean signal,

{\hat{y}}_{q, c}^{(0)}

denotes the pre-separated signal,

{\hat{y}}_{q, c}^{(1)}

denotes the initially separated signal, and

{\hat{y}}_{q, c}^{(2)}

denotes the post-separation output signal after the first iteration.

Since each iteration can improve the previous iteration result, the same loss function can be reapplied at each iteration. Here, a group of pre-determined loss functions is used to reduce complexity during the training process, making the model easier to train. Therefore, in this paper, only a three-stage loss function is used.

3. Experimental

3.1. Datasets and Microphone Structure

The effectiveness of our proposed method was evaluated using the spatialized version of the WSJ0-2mix dataset [23]. This dataset contained 20,000 training data (about 30 h), 5000 validation data (about 10 h), and 3000 test data (about 5 h), respectively, as shown in Table 1. All utterances in the training and validation sets were either expanded or truncated to four seconds, and the sampling rate for all audio data was set to 8 kHz. This dataset comprised “min” and “max” versions: in the “min” version, the speech was truncated to match the duration of the shorter utterance, while in the “max” version, the speech was extended to match the longer utterance. The “min” version was used for training and validation, while the “max” version was used for testing, to maintain consistency with baseline methods. When mixing speech from two speakers, the signal-to-interference ratio (SIR) of the speech signals varied from −5 dB to +5 dB. Then, these adjusted speech signals were convolved with the RIRs to simulate the reverberation effect in real environments. The RIRs were simulated by the image method proposed by Allen and Berkley in 1979 [24]. In the simulation, the length and width of the room varied from 5 m to 10 m, the height varied from 3 m to 4 m, the reverberation time varied from 0.2 s to 0.6 s, and the positions of the microphones and speakers were all randomly selected. The microphone array consisted of eight omnidirectional microphones placed inside a virtual sphere. The center of this sphere was located roughly at the center of the room and the radius of the sphere was randomly selected from 7.5 cm to 12.5 cm. The first four microphones were used for training and validation: two were symmetrically positioned on the surface of the sphere, while the other two were randomly positioned inside the sphere. The last four microphones were used for testing, and they were randomly positioned within the area defined by the first two microphones, to evaluate the performance of the model under unseen microphone configurations.

3.2. Model Configuration

Here are the basic model configurations for the experiment, most of which were the same as the original settings [22]. The window settings of STFT included a 32 ms frame length and a 16ms hop size. In the encoder section, the kernel size of the Conv2D layer was set to (7, 7) in order to extract local features, while in other sections, the kernel size of the Conv2D layer was set to (1, 1). In the separator section, the number of the time–frequency domain path scanning blocks in this paper was reduced to 3 from the original six, to simplify the model. Each block contained two BLSTM layers, with each BLSTM layer consisting of 128 hidden units. In addition, a parameter sharing strategy was adopted in this model, meaning the same parameters were used during the iteration separation. This strategy reduced the total number of parameters in the model, thus decreasing the computational requirements.

3.3. Training Configuration

In the model training, the batch size was set to 1. Utterance-level permutation invariant training (uPIT) was applied to address the source permutation problem. The Adam optimizer was utilized and the learning rate set to 1 × 10⁻³. Additionally, the maximum norm value of gradient clipping was set to 5. The networks and the comparison network were trained for 150 epochs to ensure fairness of the experiments.

3.4. Evaluation Metrics

The SDR of blind source-separation evaluation (BSS-Eval) [25] and scale-invariant signal-to-distortion ratio (SI-SNR) were chosen as the main objective measures of separation accuracy. Furthermore, perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and SIR were used to further evaluate the accuracy of the separated speech. It is worth noting that during the evaluation process, the first microphone was selected as the reference by default.

4. Results

4.1. Analysis of Iterative Results

The separation results of our proposed method at different stages are shown in Table 2. Here,

{\hat{y}}_{q, 1}^{(n)}

denotes the speech signal from each speaker on the first microphone after the TFDPRNN module at the n^th stage, and

{\hat{z}}_{q}^{(n)}

denotes the speech signal from each speaker after the MVDR beamformer at the n^th stage.

From Table 2, we can see that after the first iteration, the SDR performance of

{\hat{z}}_{q}^{(n)}

increased by 17.46% and the SI-SDR performance of

{\hat{z}}_{q}^{(n)}

increased by 22.66%. This shows that the first iteration brought high gains to our model. However, after the second iteration, although some improvement was still observed with each iteration, the performance improvements of

{\hat{z}}_{q}^{(n)}

in SDR and SI-SDR were very slow, with an increase of less than 1%. This indicates that as the number of iterations increased, the estimation of the SCM became more accurate and gradually approached the inherent upper performance limit of MVDR beamforming. On the other hand,

{\hat{y}}_{q, 1}^{(1)}

improved by about 51% over

{\hat{y}}_{q, 1}^{(0)}

in SDR and SI-SDR, and

{\hat{y}}_{q, 1}^{(2)}

also improved by about 11% over

{\hat{y}}_{q, 1}^{(1)}

on these two metrics. In the subsequent stages, although the performance continually improved, the rate of improvement was less than 2%. The performances of

{\hat{y}}_{q, 1}^{(n)}

and

{\hat{z}}_{q}^{(n)}

on the SIR and PESQ improved after each iteration. The STOI performance of

{\hat{y}}_{q, 1}^{(n)}

and

{\hat{z}}_{q}^{(n)}

remained at about 0.99. Overall, the performance of our model obviously improved during the iteration process.

In order to describe the results of the method graphically, Figure 3 shows a comparison of the spectrograms for a single speaker. They include the original clean speech signal, the original reverberant speech signal, the mixed speech signal, the output signals of the neural network at different stages, and the output signals of the beamformer at different stages, respectively. The colors in the spectrogram correspond to the energy levels of the spectral components. Observing the output spectrograms of neural network in the left column, it is clear when the processing stage increased, the clarity and quality of the spectrograms gradually improved. For example, compared with the spectrogram of the original reverberant signal, it can be observed that certain spectral components in the spectrogram of the first neural network were reduced (e.g., the red circles). After iterative processing, these signal components gradually recovered in the spectrogram of the second neural network. In the speech spectrogram in the right column, the separation effect of the beamformer at different stages can be observed; there were no significant changes or improvements. When comparing the speech spectrograms in the left and right columns, we can see that the output spectrograms processed by the neural network exhibited relatively higher clarity compared with those processed by the beamformer. This comparison reveals the potential advantages of neural networks in processing speech signals.

In general, the iteration of MVDR and post-separation is fully employed for promoting their individual optimization, which ultimately improves overall performance. In addition, the outputs of MVDR and post-separation can both be used as the final output of the model. However, the performance of the output of post-separation was better than that of the MVDR beamformer, according to Table 1, so the former was used as the final output of the model. In addition, each additional processing leads to an increase in the real-time factor (RTF) of the model, thereby increasing the data processing time. Considering this, the output of post-separation after the first iteration was used for evaluating the results of our model in comparison with other methods.

4.2. Comparison with Reference Methods

At present, there are two kinds of mainstream methods to address the task of multi-channel speech separation; one is the frequency domain-based separation method, the other is the time domain-based separation method. The most popular multi-channel speech separation methods are listed as follows:

(a): Filter-and-sum networking (FaSNet) [26] is a time-domain method that uses a neural network to implement beamforming technology. This method utilizes deep learning to automatically learn and optimize the weights and parameters of the beamformer. The core advantage of this method is its adaptability, allowing the network to adjust according to the complexity and diversity of the speech signal;
(b): Narrow-band (NB) BLSTM [27] is a frequency-domain method using the BLSTM network, which is specially focused on narrow-band frequency processing and is trained by full-band methods to improve its performance. By processing each narrow-band frequency component separately in the frequency domain, this method can effectively identify and separate individual speakers in overlapped speech;
(c): Beam-TasNet [9] is a classical speech separation method that combines time-domain and frequency-domain approaches. First, the time-domain neural network is used for pre-separation. Subsequently, these pre-separated speech signals are used to calculate the SCM of the beamformer. Finally, the separated signal is obtained by the beamformer;
(d): Beam-guided TasNet [19] is a two-stage speech separation method that also combines both time-domain and frequency-domain approaches. In the first stage, the initial speech separation is performed using Beam-TasNet. In the second stage, the network structure remains the same as Beam-TasNet, but the input includes the output from the first stage. This iterative process helps to further refine the separation of the initial speech.
(e): Beam-TFDPRNN [15] is our previously proposed time–frequency speech separation method, which, like Beam-TasNet, also uses a neural beamforming structure. This method has more advantages in the reverberant environment, because it uses a time-frequency domain network with more anti-reverberant ability for the pre-separation.

The experimental results on the spatialized version of the WSJ0-2mix dataset for the proposed method and current popular methods are shown in Table 2. It should be emphasized that the results of Beam-TasNet are directly cited from the original paper, while the result for beam-guided TasNet were obtained from our replicate test, the difference being about 0.2dB from the original result.

From Table 3, we can see that the performance of FaSNet and NB-BLSTM was not satisfactory. In comparison, the Beam-TasNet and Beam-TFDPRNN demonstrated good separation performance. The beam-guided TasNet further improved the performance of Beam-TasNet by employing an iteratively refined structure. The proposed method, iBeam-TFDPRNN, significantly outperformed the other methods. In addition, our proposed model included only 2.8M parameters, fewer than the parameters of most other methods. In conclusion, the method proposed in this paper demonstrated excellent performance compared with the reference methods.

4.3. Performance on the Joint Separation and Dereverberation Tasks

In the previous section, we discussed the performance of the proposed method in the task of reverberation. In order to comprehensively evaluate the performance of the proposed method and explore its performance in different tasks, this section explores the performance of the joint separation and dereverberation tasks, where the goal was to separate reverberant mixtures and produce anechoic speech signals.

The experimental results in Table 4 show the performance of our proposed method and the reference methods on the dereverberation task. We can see that our proposed method demonstrated a significant advantage over the reference methods. Specifically, the proposed method achieved an SDR of 20.2 dB, much better than Beam-TasNet, and exceeded the beam-guided TasNet by 2.1 dB. Additionally, it exceeded the oracle mask-based MVDR by 8.2 dB and narrowed the gap to just 0.9 dB compared with the oracle signal-based MVDR. These results highlight the effectiveness of our proposed method in joint separation and dereverberation tasks, demonstrating its potential for real-world applications.

5. Discussion

Our experimental results demonstrated the significant effectiveness of the proposed method in both multi-channel speech separation and dereverberation tasks. The iterative results in Table 2 reveal that the performance of SDR and SI-SDR improved by 17.46% and 22.66% after the first iteration, indicating that the initial iteration effectively enhanced the model’s performance. The spectrogram of separated speech at different stages further proves the effectiveness of our method. As the processing stages increased, the quality of the spectrographic output of the neural network gradually improved, demonstrating the advantages of neural networks in processing speech signals.

When our method was compared with other popular multi-channel speech separation methods, such as FaSNet, NB-BLSTM, Beam-TasNet, beam-guided TasNet and Beam-TFDPRNN, the proposed iBeam-TFDPRNN demonstrated superior performance. As shown in Table 3, our method significantly outperformed these methods in terms of SDR, SI-SDR, PESQ, and SIR. Notably, our model included only 2.8 M parameters, which was smaller in parameter size than most other methods.

In the joint separation and dereverberation tasks, our method achieved significant improvements over the reference methods. As shown in Table 4, our method achieved an SDR of 20.2 dB, achieving much better results than the oracle mask-based MVDR, Beam-TasNet, and beam-guided TasNet. In addition, this result was also very close to the performance of the oracle signal-based MVDR. These results highlight the potential of our method for real-world applications.

Despite our method having achieved good experimental results, there are still some limitations. For example, the dataset used in the experiment did not contain noise, meaning it may not have fully represented all real-world scenarios. Furthermore, each additional processing stage increased the RTF of the model, which may have impacted the model’s practicality for real-time applications. In future work, we will explore more diverse datasets and optimization of the computational efficiency of the model to address these limitations.

6. Conclusions and Future Work

In this paper, an iteratively refined multi-channel method is proposed to improve the performance of speech separation in complex environments. Benefiting from the strength of neural beamforming and the multi-stage iteratively refined structure, the proposed method achieved outstanding performance. The experiments on the spatialized version of the WSJ0-2mix corpus showed that the proposed method achieved significant improvements. Specifically, our method achieved an SDR improvement of 24.17 dB, demonstrating that our method not only provided good separation performance in a reverberant environment but also had significant advantages compared with current popular methods of speech separation. In addition, the method also achieved an SDR of 20.2 dB on joint separation and dereverberation tasks, further showing its promising capability. However, the dataset in this paper did not contain noise components. Therefore, exploring speech separation problems in noisy environments will be our future research direction, and the effectiveness of this method in realistic environments will be further validated by using noisy datasets such as LibriCSS [28] and WHAMR! [29].

Author Contributions

Conceptualization, X.Z. and C.B.; methodology, X.Z.; software, X.Z.; validation, X.Z., X.Y. and J.Z.; formal analysis, X.Z., X.Y. and J.Z.; investigation, X.Z.; resources, X.Z.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., C.B., X.Y. and J.Z.; visualization, X.Z.; supervision, C.B.; project administration, C.B.; funding acquisition, C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61831019.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors are grateful to the thorough reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Z.; Li, J.; Xiao, X.; Yoshioka, T.; Wang, H.; Wang, Z.; Gong, Y. Cracking the Cocktail Party Problem by Multi-Beam Deep Attractor Network. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 437–444. [Google Scholar]
Qian, Y.; Weng, C.; Chang, X.; Wang, S.; Yu, D. Past Review, Current Progress, and Challenges Ahead on the Cocktail Party Problem. Front. Inf. Technol. Electron. Eng. 2018, 19, 40–63. [Google Scholar] [CrossRef]
Chen, J.; Mao, Q.; Liu, D. Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. arXiv 2020, arXiv:2007.13975. [Google Scholar]
Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed]
Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention Is All You Need in Speech Separation. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6 June 2021; pp. 21–25. [Google Scholar]
Zhao, S.; Ma, Y.; Ni, C.; Zhang, C.; Wang, H.; Nguyen, T.H.; Zhou, K.; Yip, J.; Ng, D.; Ma, B. MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. arXiv 2023, arXiv:2312.11825. [Google Scholar]
Gannot, S.; Vincent, E.; Markovich-Golan, S.; Ozerov, A. A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 692–730. [Google Scholar] [CrossRef]
Anguera, X.; Wooters, C.; Hernando, J. Acoustic Beamforming for Speaker Diarization of Meetings. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2011–2022. [Google Scholar] [CrossRef]
Ochiai, T.; Delcroix, M.; Ikeshita, R.; Kinoshita, K.; Nakatani, T.; Araki, S. Beam-TasNet: Time-Domain Audio Separation Network Meets Frequency-Domain Beamformer. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6384–6388. [Google Scholar]
Zhang, X.; Wang, Z.-Q.; Wang, D. A Speech Enhancement Algorithm by Iterating Single- and Multi-Microphone Processing and Its Application to Robust ASR. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 276–280. [Google Scholar]
Erdogan, H.; Hershey, J.R.; Watanabe, S.; Mandel, M.I.; Roux, J.L. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks. In Proceedings of the Interspeech 2016, ISCA, San Francisco, CA, USA, 8 September 2016; pp. 1981–1985. [Google Scholar]
Gu, R.; Zhang, S.-X.; Zou, Y.; Yu, D. Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 849–862. [Google Scholar] [CrossRef]
Xiao, X.; Zhao, S.; Jones, D.L.; Chng, E.S.; Li, H. On Time-Frequency Mask Estimation for MVDR Beamforming with Application in Robust Speech Recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 3246–3250. [Google Scholar]
Luo, Y. A Time-Domain Real-Valued Generalized Wiener Filter for Multi-Channel Neural Separation Systems. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 3008–3019. [Google Scholar] [CrossRef]
Zhang, X.; Bao, C.; Zhou, J.; Yang, X. A Beam-TFDPRNN Based Speech Separation Method in Reverberant Environments. In Proceedings of the 2023 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Zhengzhou, China, 14 November 2023; pp. 1–5. [Google Scholar]
Kavalerov, I.; Wisdom, S.; Erdogan, H.; Patton, B.; Wilson, K.; Roux, J.L.; Hershey, J.R. Universal Sound Separation. arXiv 2019, arXiv:1905.03330. [Google Scholar]
Tzinis, E.; Wisdom, S.; Hershey, J.R.; Jansen, A.; Ellis, D.P.W. Improving Universal Sound Separation Using Sound Classification. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 96–100. [Google Scholar]
Shi, Z.; Liu, R.; Han, J. LaFurca: Iterative Refined Speech Separation Based on Context-Aware Dual-Path Parallel Bi-LSTM. arXiv 2020, arXiv:2001.08998. [Google Scholar]
Chen, H.; Yi, Y.; Feng, D.; Zhang, P. Beam-Guided TasNet: An Iterative Speech Separation Framework with Multi-Channel Output. arXiv 2022, arXiv:2102.02998. [Google Scholar]
Wang, Z.-Q.; Erdogan, H.; Wisdom, S.; Wilson, K.; Raj, D.; Watanabe, S.; Chen, Z.; Hershey, J.R. Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19 January 2021; pp. 905–911. [Google Scholar]
Yang, L.; Liu, W.; Wang, W. TFPSNet: Time-Frequency Domain Path Scanning Network for Speech Separation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar] [CrossRef]
Yang, X.; Bao, C.; Zhang, X.; Chen, X. Monaural Speech Separation Method Based on Recurrent Attention with Parallel Branches. In Proceedings of the INTERSPEECH 2023, ISCA, Dublin, Ireland, 20 August 2023; pp. 3794–3798. [Google Scholar]
Wang, Z.-Q.; Le Roux, J.; Hershey, J.R. Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1–5. [Google Scholar]
Allen, J.B.; Berkley, D.A. Image Method for Efficiently Simulating Small-Room Acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Févotte, C.; Gribonval, R.; Vincent, E. BSS_EVAL Toolbox User Guide Revision 2.0; IRISA: Rennes, France, 2011; p. 22. [Google Scholar]
Luo, Y.; Ceolini, E.; Han, C.; Liu, S.-C.; Mesgarani, N. FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing. arXiv 2019, arXiv:1909.13387. [Google Scholar]
Quan, C.; Li, X. Multi-Channel Narrow-Band Deep Speech Separation with Full-Band Permutation Invariant Training. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23 May 2022; pp. 541–545. [Google Scholar]
Chen, Z.; Yoshioka, T.; Lu, L.; Zhou, T.; Meng, Z.; Luo, Y.; Wu, J.; Xiao, X.; Li, J. Continuous Speech Separation: Dataset and Analysis. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020. [Google Scholar]
Maciejewski, M.; Wichern, G.; McQuinn, E.; Roux, J.L. WHAMR!: Noisy and Reverberant Single-Channel Speech Separation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 696–700. [Google Scholar]

Figure 1. The overall structure of the initial separation.

Figure 2. The structure of iBeam-TFDPRNN.

Figure 3. Spectrographic comparison of the separated speech at different stages.

Table 1. Summary of data used for training, development, and testing.

Dataset	Number of Samplers	Total Duration (h)
Training	20,000	30
Development	5000	10
Testing	3000	5

Table 2. Comparison of speech separation results at different stages.

Stage	RTF	SDR		SI-SDR		SIR		PESQ		STOI
Stage	RTF	${\hat{z}}_{q}^{(n)}$	${\hat{y}}_{q, 1}^{(n)}$	${\hat{z}}_{q}^{(n)}$	${\hat{y}}_{q, 1}^{(n)}$	${\hat{z}}_{q}^{(n)}$	${\hat{y}}_{q, 1}^{(n)}$	${\hat{z}}_{q}^{(n)}$	${\hat{y}}_{q, 1}^{(n)}$	${\hat{z}}_{q}^{(n)}$	${\hat{y}}_{q, 1}^{(n)}$
0	0.024	-	14.41	-	13.91	-	30.67	-	4.13	-	0.98
1	0.049	18.79	21.84	17.08	21.13	27.21	30.67	3.93	4.13	0.99	0.98
2	0.074	22.07	24.17	20.95	23.48	33.21	33.21	3.93	4.23	0.99	0.99
3	0.104	22.10	24.48	21.07	23.84	33.23	33.93	3.99	4.25	0.99	0.99
4	0.125	22.21	24.91	21.20	24.26	33.80	34.36	3.98	4.26	0.99	0.99
5	0.148	22.31	24.60	21.34	23.98	34.05	34.24	4.00	4.25	0.99	0.99

Table 3. Comparison with reference methods on the spatialized version of the WSJ0-2mix dataset.

Method	Param	SDR	SI-SDR	PESQ	SIR	STOI
FaSNet	2.8 M	11.96	11.69	3.16	18.97	0.93
NB-BLSTM	1.2 M	8.22	6.90	2.44	12.13	0.83
Beam-TasNet	5.4 M	17.40	-	-	-	-
Beam-guided TasNet	5.5 M	20.52	19.49	3.88	27.49	0.98
Beam-TFDPRNN	2.7 M	17.20	16.80	3.68	26.77	0.96
iBeam-TFDPRNN	2.8 M	24.17	23.48	4.23	33.21	0.99

Table 4. Performance on joint separation and dereverberation tasks.

Method	SDR
Method	${\hat{y}}_{q, 1}^{(n)}$	${\hat{z}}_{q, 1}^{(n)}$
Beam TasNet	10.8	14.6
Beam-guided TasNet	16.5	17.1
iBeam-TFDPRNN	20.2	19.7
Oracle mask-based MVDR	11.4	12.0
Oracle signal-based MVDR	∞	21.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Bao, C.; Yang, X.; Zhou, J. Iteratively Refined Multi-Channel Speech Separation. Appl. Sci. 2024, 14, 6375. https://doi.org/10.3390/app14146375

AMA Style

Zhang X, Bao C, Yang X, Zhou J. Iteratively Refined Multi-Channel Speech Separation. Applied Sciences. 2024; 14(14):6375. https://doi.org/10.3390/app14146375

Chicago/Turabian Style

Zhang, Xu, Changchun Bao, Xue Yang, and Jing Zhou. 2024. "Iteratively Refined Multi-Channel Speech Separation" Applied Sciences 14, no. 14: 6375. https://doi.org/10.3390/app14146375

APA Style

Zhang, X., Bao, C., Yang, X., & Zhou, J. (2024). Iteratively Refined Multi-Channel Speech Separation. Applied Sciences, 14(14), 6375. https://doi.org/10.3390/app14146375

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Iteratively Refined Multi-Channel Speech Separation

Abstract

1. Introduction

2. Proposed Method

2.1. Signal Model

2.2. Initial Separation

2.3. Iterative Separation

2.4. Loss Function

3. Experimental

3.1. Datasets and Microphone Structure

3.2. Model Configuration

3.3. Training Configuration

3.4. Evaluation Metrics

4. Results

4.1. Analysis of Iterative Results

4.2. Comparison with Reference Methods

4.3. Performance on the Joint Separation and Dereverberation Tasks

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI