Next Article in Journal
Current Research in Lidar Technology Used for the Remote Sensing of Atmospheric Aerosols
Next Article in Special Issue
Analysis of Maneuvering Targets with Complex Motions by Two-Dimensional Product Modified Lv’s Distribution for Quadratic Frequency Modulation Signals
Previous Article in Journal
Fano Effect and Quantum Entanglement in Hybrid Semiconductor Quantum Dot-Metal Nanoparticle System
Previous Article in Special Issue
Two Novel Two-Stage Direction of Arrival Estimation Algorithms for Two-Dimensional Mixed Noncircular and Circular Sources
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation

Department of Electronic Engineering/Graduate School at Shenzhen, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Sensors 2017, 17(6), 1447; https://doi.org/10.3390/s17061447
Submission received: 11 April 2017 / Revised: 2 June 2017 / Accepted: 6 June 2017 / Published: 20 June 2017

Abstract

:
In speech separation tasks, many separation methods have the limitation that the microphones are closely spaced, which means that these methods are unprevailing for phase wrap-around. In this paper, we present a novel speech separation scheme by using two microphones that does not have this restriction. The technique utilizes the estimation of interaural time difference (ITD) statistics and binary time-frequency mask for the separation of mixed speech sources. The novelties of the paper consist in: (1) the extended application of delay-and-sum beamforming (DSB) and cosine function for ITD calculation; and (2) the clarification of the connection between ideal binary mask and DSB amplitude ratio. Our objective quality evaluation experiments demonstrate the effectiveness of the proposed method.

1. Introduction

A common example of the well-known ‘cocktail party’ problem is the situation in which the voices of two speakers overlap. How to solve the ‘cocktail party’ problem and obtain an enhanced voice of a particular speaker in machines have grabbed serious attention of researchers.
As for single-channel speech separations, independent component analysis (ICA) [1] and nonnegative-matrix factorization (NMF) [2] are the conventional methods. However, the assumption that signals are statistically independent in ICA and the model in NMF is linear limit their applications. Moreover, NMF generally requires a large amount of computation to determine the speaker independent basis. Recently, in [3], the authors proposed an online adaptive process independent of parameter initialization, with noise reduction as a pre-processing step. Using adaptive parameters computed frame-by-frame, this article constructs a Time Frequency (TF) mask for the separation process. In [4], the authors proposed a pseudo-stereo mixture model by reformulating the binaural blind speech separation algorithm for the monaural speech separation problem. The algorithm estimates the source characteristics and constructs the masks with the parameters estimated through a weighted complex 2D histogram.
Normally, multiple channel sources are separated by measuring the differences of arrival time and sound intensity between microphones [5,6], which are also referred to as the interaural time differences (ITD) and the interaural intensity differences (IID). Interaural phase differences (IPD) have been used in [7,8]. The authors proposed a speech enhancement algorithm that utilizes phase-error based filters that depend only on the phase of the signals. Performances of the above systems depend on how the ITD (or IPD) threshold is selected. Instead of a fixed threshold, in [9], the authors employed a statistical modeling of angle distributions together with a channel weighting to determine which signal components belong to the target signal and which components are part of the background. In [10], the authors proposed a method based on a prediction of the coherence function and then estimated the signal to noise ratio (SNR) to generate Wiener filter. In [11], the author presented a method based on independent component analysis (ICA) and binary time-frequency masking. In [12], the authors proposed that a rough estimate of channel level difference (CLD) threshold yielding the best Signal-to-Distortion Ratio (SDR) could be obtained by cross-correlating the separated sounds. In addition, a combination of negative matrix factorization (NMF) with spatial localization via the generalized cross correlation (GCC) is applied for two-channel speech separation in [13]. For two-channel convolutive source separation, as the number of parameters in the NMF2D grows exponentially and the number of frequency basis increases linearly, the issues of model-order fitness, initialization and parameters estimation become even more critical. In [14], the authors proposed a Gaussian Expectation Maximization and Multiplicative Update (GEM-MU) algorithm to calculate the NMF2D with adaptive sparsity model and to utilize a Gamma-Exponential process in order to estimate the number of components and number of convolutive parameters in NMF2D.
The goal of this paper is to cope with competing-talker scenarios by dual-channel mixtures. In this study, we use DSB to generate the cosine function that evaluates ITD by using several frames of the short-time Fourier transform (STFT) and makes target and competing signals have the same characteristics. Then, we utilize the binary time-frequency mask to obtain the target source. There are two contributions in this paper:
(1)
we novelly upgrade delay-and-sum beamforming (DSB) [15] for estimating the ITD; and
(2)
for the first time, we clarify the connections between ideal binary mask and DSB amplitude ratio. The framework of our approach is illustrated in Figure 1. Moreover, our proposed algorithm can handle the problem of phase wrap-around.
The remainder of this paper is organized as follows: Section 2 provides an overview of time difference model. Our proposed approach including system overview and algorithm will be discussed in Section 3. In Section 4, we will introduce source separation. Then, Section 5 shows our evaluations of the system. Finally, Section 6 puts forward the main conclusions of the work.

2. Time Difference Model

We suppose that there are I ( I = 2 ) sources (subscript 1 to represent the target and subscript 2 to represent the noise) in a sonic environment. The signals from two different microphones are defined, respectively, as:
x L ( t ) = i = 1 I a i L s i ( t ) , x R ( t ) = i = 1 I a i R s i ( t τ i ) ,
where a i L and a i R denote the weighted coefficients of the recordings of the left and right microphone from the i-th source separately. τ i is the time delay of arrival (TDOA) of the i-th source between two microphones. Equation (1) can be simplified as:
x L ( t ) = i = 1 I s i ( t ) , x R ( t ) = i = 1 I b i s i ( t τ i ) ,
where b i is the ratio of a i L and a i R . By the short-time Fourier transform (STFT), the signals can be expressed as:
X L [ m , k ] = i = 1 I S i [ m , k ] , X R [ m , k ] = i = 1 I b i S i [ m , k ] × e j ω k τ i ,
where m is the frame index and ω k = 2 π k / K . k and K are the frequency index and total window length, respectively. Under the assumption of Wdisjoint orthogonal [16], Equation (3) can be rewritten as:
X L [ m , k ] S i [ m , k ] , X R [ m , k ] b i S i [ m , k ] × e j ω k τ i .
Thus, once the TDOA is obtained, we can make a simple binary decision concerning whether the time-frequency bin [ m , k ] is likely to belong to the target speaker or not.

3. Proposed Approach

Delay-and-sum (DSB) is an effective means for speech enhancement. Our method is based on DSB under the anechoic condition in the time-frequency domain. In DSB, the enhanced speeches in the time-frequency domain are modeled as:
Y 1 [ m , k ] = X L [ m , k ] + X R [ m , k ] × e j ω k τ ^ 1 2 , Y 2 [ m , k ] = X L [ m , k ] + X R [ m , k ] × e j ω k τ ^ 2 2 ,
where Y 1 [ m , k ] and Y 2 [ m , k ] are the enhanced speech of target and interferer, respectively.
Theoretically, once the correct estimations of τ 1 and τ 2 are obtained, Equation (5) is written as:
Y 1 [ m , k ] Y 2 [ m , k ] = 1 + b 1 1 + b 1 × e j ω k ( τ 2 τ 1 ) , if   [ m , k ]     s 1 , 1 + b 2 × e j ω k ( τ 1 τ 2 ) 1 + b 2 , if   [ m , k ]     s 2 .
We define g [ k ] as:
g [ k ] = 1 M m = 1 M Y 1 [ m , k ] Y 2 [ m , k ] sgn ( 1 Y 1 [ m , k ] Y 2 [ m , k ] ) ,
where
sgn ( x ) = 1 , x 0 , 1 , x < 0 .
According to Equations (6) and (7), we treat g T h e [ k ] as the theoretical result of g [ k ] . Under the assumption of far-field ( b 1 b 2 ), g T h e [ k ] is simplified to
g T h e [ k ] 1 + b 1 × e j ω k ( τ 2 τ 1 ) 1 + b 1 .
We may obtain
g T h e [ k ] 1 2 b 1 ( 1 cos ( ω k × ( τ 2 τ 1 ) ) ) ( 1 + b 1 ) 2 ,
where g T h e [ k ] is the cosine function. Specially, if b 1 equals 1, we have
g T h e [ k ] cos ω k × ( τ 2 τ 1 ) 2 .
Obviously, the maximum of g T h e [ k ] is 1. Furthermore, we let g r e a l [ k ] be the real data of g [ k ] according to Equation (6). To ensure that the maximum of g r e a l [ k ] is 1, we rectify g r e a l [ k ] as:
g r e a l _ r [ k ] = g r e a l [ k ] + 1 max g r e a l [ k ] .
We define the minimum of g r e a l [ k ] as g m i n [ k ] . Under the correct estimations of τ 1 and τ 2 , g r e a l [ k ] approximately equals g T h e [ k ] . According to Equation (10), b 1 can be estimated as:
b ^ 1 = 1 g m i n [ k ] 1 + g m i n [ k ] .
Figure 2 demonstrates the process of ITD estimation. Figure 3 gives an example about the cosine functions with different estimations of ITD.
We define the criterion function as:
J = 1 Σ k = 1 K g r e a l _ r [ k ] g T h e [ k ] .
Because of the periodicity of Trigonometric function, we fix | ω k ( τ 1 τ 2 ) | < π . We use the summation on all frequency bands to avoid phase wrap-around problem. Then, we have
τ ^ 1 o p t , τ ^ 2 o p t = arg max τ ^ 1 , τ ^ 2 J .

4. Source Separation

After obtaining the ITD and attenuation coefficients (namely b 1 and b 2 ), we adopt the masking method to separate the target and competing sources. Firstly, we illustrate the effects of attenuation coefficients. Then, we utilize the time-frequency mask based on the DSB ratio.

4.1. The Effects of Weighted Coefficients

In Equation (10), we assume b 1 b 2 , but sometimes experiment settings can not meet this hypothesis strictly. In this section, we set different values of b 1 and b 2 artificially to demonstrate the effectiveness of the criterion function in Equation (14). We verify the effects of b 1 and b 2 with a simple example. Assume that
x l ( t ) = s 1 ( t ) + s 2 ( t ) , x 2 ( t ) = b 1 × s 1 ( t 6.1 ) + b 2 × s 2 ( t 1.9 ) .
The details are shown in Figure 4. We can observe that even experiment settings do not meet the assumption that b 1 b 2 strictly, and the ITD still can be estimated accurately. Moreover, though the values of b ^ 1 and b ^ 2 are rough, the binary mask is free from attenuation coefficients since the DSB based mask only relies on ITD information.

4.2. Mask Based on DSB Ratio

Under the assumption of Wdisjoint orthogonal, the ideal ratio mask is defined using a priori energy ratio R S N R [ m , k ] [17]:
R S N R [ m , k ] = Y 1 [ m , k ] 2 Y 1 [ m , k ] 2 + Y 2 [ m , k ] 2 .
In addition, the ideal binary is of the form:
B [ m , k ] = 1 , R S N R [ m , k ] λ , 0 , R S N R [ m , k ] < λ ,
where λ is set to be a value in 0.2 0.8 .
In our theoretical framework, 1 + b 1 1 + b 1 × e j ω k ( τ 2 τ 1 ) is greater than 1 according to Equation (6), while 1 + b 2 × e j ω k ( τ 2 τ 1 ) 1 + b 2 is always less than 1. Then, the DSB ratio is of the form:
R D S B [ m , k ] = | Y 1 [ m , k ] Y 2 [ m , k ] | 1 , if   [ m , k ]   s 1 , | Y 1 [ m , k ] Y 2 [ m , k ] | < 1 , if   [ m , k ]   s 2 .
Comparing R D S B [ m , k ] to 1, the binary time-frequency mask is obtained as:
M [ m , k ] = 1 , if    R D S B [ m , k ] 1 , 0 , otherwise .
It is easy to find that when λ is set to 0.5, B [ m , k ] is equivalent to M [ m , k ] . Equations (6) and (20) demonstrate the essence that λ = 0.5 provides the best performance under the assumption of Wdisjoint orthogonal. Then, the speech can be separated as:
S ^ 1 [ m , k ] = M [ m , k ] X 1 [ m , k ] , S ^ 2 [ m , k ] = ( 1 M [ m , k ] ) X 2 [ m , k ] ,
where X [ m , k ] is defined as:
X i [ m , k ] = 1 2 [ DFT ( x L ( t ) ) + DFT ( x R ( t t i ) ) ] .
Finally, we can obtain the separated speech waveforms using the Inverse Fast Fourier Transform (IFFT) and OverLapping and Adding (OLA).

5. Experimental Evaluations

In this section, we first describe the experimental data and evaluation criteria that we used, and then present experimental results.

5.1. Experimental Setup

Figure 5 depicts the simulated experimental set-up. The sources are selected from the TIMIT database [18]. The sample rate of these audio files is 16,000 Hz. For simulated data, we evaluate the target speech separation performance using Perceptual Evaluation of Speech Quality (PESQ), C s i g , C b a k and C o v l [19]. These new composite measures show moderate advantages over the existing objective measures [19]. To meet the SiSEC 2010 campaign’s evaluation criteria, we adopt the standard Source-to-Interference Ratio (SIR) [20] for SiSEC 2010 test data. For these objective measures, the higher values mean better performance.
The window length is 1024 samples with an overlap of 75%. We can calculate the voiced frames detected by Voice Active Detector (VAD) [21] to avoid the situation that Y 2 [ m , k ] = 0 . Actually, Y 2 [ m , k ] = 0 hardly occurs and we do not have this operation in our experiment. Once the amplitude of Y 2 [ m , k ] is nonzero, we treat Y 2 [ m , k ] as one of the speakers.

5.2. Simulated Data

We generate data for the setup in Figure 5 with source signals of duration 2 s. Reverberation simulations are accomplished using the Room Impulse Response (RIR) open source software package [22] based on the image method. We generate 100 mixed sentences for each experimental set. Table 1 and Table 2 show the ITD estimated results in terms of mean square errors. In our experiment, the units of ITD are represented by τ × f s . We compare our approach with other existing DUET [23], Messl [24], and Izumi [25] methods. Unlike the algorithms based on coherence, our method consolidates the estimation of τ 1 and τ 2 into one cosine function. Our method acquires better ITD estimation. Table 3 shows the relations between microphone distances with ITD estimated results. The real ITD is proportional to the distances. The estimated ITDs calculated by our method meet this rule. For all of the distances in our experiment, the proposed method provides better ITD estimations that influence the separation results. Figure 6 shows the details with ITD estimation. Though our method does not take reverberation into consideration, the results demonstrate that our method is effective for low reverberation ( R T 60 = 150 ms) conditions. Figure 7 shows the target source separation performance and illustrates that our method has comparable performance. Figure 8 shows the target source separation performance for different microphone distances. For different microphone distances, the source separation performances are effective. Compared with other methods, the proposed method yields better results for all of the microphone distances.

5.3. SiSEC 2010 Test Data

The data of D2-2 sets of the Signal Separation Evaluation Campaign (SiSEC) [26] consists of two-microphone real world recordings. We applied the proposed method to set1 for both room1 and room2. We only compare our method with the classical Fast-ICA [27], since the results with other methods can be found online. Figure 9 shows ITD estimation details. Table 1 and Table 2 illustrate that our method can achieve competitive results.
In Figure 10, we demonstrate the trends between λ and mean SIR for room1 and room2. Mean SIR is symmetrical to λ = 0.5 , where mean SIR achieves the best performance. These characteristics are consistent with our method.
Table 4 shows the separation performance for both room1 and room2.

6. Conclusions

In this paper, we have proposed a novel method based on DSB for dual-channel sources separation. Our method, for the first time, employs the extension of DSB for estimating interaural time difference (ITD) and illustrates the connection between ideal binary mask and DSB amplitude ratio. Our method is valid for phase wrap-around. Although our method is based on the assumption of an anechoic environment, the results illustrate the effectiveness for low reverberation environment ( R T 60 = 150 ms). Objective evaluations demonstrate the effectiveness of our proposed methods.
In this paper, we focus on the estimation of the interaural time differences (ITD). In fact, the construction of an effective masking model is also very critical. We could attempt to replace our Time-Frequency Masking with an NMF2D model as proposed in [14], and adopt the GEM-MU and Gamma-Exponential process to separate sound sources. Moreover, in the presence of background noise, the idea of noise reduction in [3] is also valuable for our dual-channel speech separation.

Author Contributions

Xuliang Li performed the experiments and analyzed the data; Zhaogui Ding designed the experiments and analyzed the data; Weifeng Li and Qingmin Liao helped to discuss the results and revise the paper. All authors have read and approved the submission of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kouchaki, S.; Sanei, S. Supervised single channel source separation of EEG signals. In Proceedings of the 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Southampton, UK, 22–25 September 2003; pp. 1–5. [Google Scholar]
  2. Gao, B.; Woo, W.; Dlay, S. Single-channel source separation using EMD-subband variable regularized sparse features. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 961–976. [Google Scholar] [CrossRef]
  3. Tengtrairat, N.; Woo, W.L.; Dlay, S.S.; Gao, B. Online noisy single-channel source separation using adaptive spectrum amplitude estimator and masking. IEEE Trans. Signal Process. 2016, 64, 1881–1895. [Google Scholar] [CrossRef]
  4. Tengtrairat, N.; Gao, B.; Woo, W.L.; Dlay, S.S. Single-channel blind separation using pseudo-stereo mixture and complex 2D histogram. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1722–1735. [Google Scholar] [CrossRef] [PubMed]
  5. Clark, B.; Flint, J.A. Acoustical direction finding with time-modulated arrays. Sensors 2016, 16, 2107. [Google Scholar] [CrossRef] [PubMed]
  6. Velasco, J.; Pizarro, D.; Macias-Guarasa, J. Source localization with acoustic sensor arrays using generative model based fitting with sparse constraints. Sensors 2012, 12, 13781–13812. [Google Scholar] [CrossRef] [PubMed]
  7. Aarabi, P.; Shi, G. Phase-based dual-microphone robust speech enhancement. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2004, 34, 1763–1773. [Google Scholar] [CrossRef]
  8. Kim, C.; Stern, R.M.; Eom, K.; Lee, J. Automatic selection of thresholds for signal separation algorithms based on interaural delay. Proccedings of the INTERSPEECH 2010, Chiba, Japan, 26–30 September 2010; pp. 729–732. [Google Scholar]
  9. Kim, C.; Khawand, C.; Stern, R.M. Two-microphone source separation algorithm based on statistical modeling of angle distributions. Proccedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4629–4632. [Google Scholar]
  10. Yousefian, N.; Loizou, P.C. A dual-microphone algorithm that can cope with competing-talker scenarios. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 145–155. [Google Scholar] [CrossRef]
  11. Pedersen, M.S.; Wang, D.; Larsen, J.; Kjems, U. Two-microphone separation of speech mixtures. IEEE Trans. Neural Netw. 2008, 19, 475–492. [Google Scholar] [CrossRef] [PubMed]
  12. Nishiguchi, M.; Morikawa, A.; Watanabe, K.; Abe, K.; Takane, S. Sound source separation and synthesis for audio enhancement based on spectral amplitudes of two-channel stereo signals. J. Acoust. Soc. Am. 2016, 140, 3428. [Google Scholar] [CrossRef]
  13. Wood, S.; Rouat, J. Blind speech separation with GCC-NMF. In Proceedings of the INTERSPEECH 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
  14. Al-Tmeme, A.; Woo, W.L.; Dlay, S.S.; Gao, B.; Al-Tmeme, A.; Woo, W.L.; Dlay, S.S.; Gao, B.; Woo, W.L.; Dlay, S.S.; et al. Underdetermined Convolutive Source Separation Using GEM-MU With Variational Approximated Optimum Model Order NMF2D. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 2017, 25, 35–49. [Google Scholar] [CrossRef]
  15. Brandstein, M.; Ward, D. Microphone Arrays: Signal Processing Techniques and Applications; Springer: Berlin, Germany, 2001. [Google Scholar]
  16. Yilmaz, O.; Rickard, S. Blind separation of speech mixtures via time-frequency masking. IEEE Trans. Signal Process. 2004, 52, 1830–1847. [Google Scholar] [CrossRef]
  17. Srinivasan, S.; Roman, N.; Wang, D. Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 2006, 48, 1486–1501. [Google Scholar] [CrossRef]
  18. Zue, V.; Seneff, S.; Glass, J. Speech database development at MIT: TIMIT and beyond. Speech Commun. 1990, 9, 351–356. [Google Scholar] [CrossRef]
  19. Hu, Y.; Loizou, P.C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 229–238. [Google Scholar] [CrossRef]
  20. Vincent, E.; Sawada, H.; Bofill, P.; Makino, S.; Rosca, J.P. First stereo audio source separation evaluation campaign: Data, algorithms and results. In Independent Component Analysis and Signal Separation; Springer: Heidelberg, Germany, 2007; pp. 552–559. [Google Scholar]
  21. Cho, Y.D.; Kondoz, A. Analysis and improvement of a statistical model-based voice activity detector. IEEE Signal Process. Lett. 2001, 8, 276–278. [Google Scholar]
  22. Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
  23. Wang, Y.; Yılmaz, Ö.; Zhou, Z. Phase aliasing correction for robust blind source separation using DUET. Appl. Comput. Harmonic Anal. 2013, 35, 341–349. [Google Scholar] [CrossRef]
  24. Mandel, M.; Weiss, R.J.; Ellis, D.P. Model-based expectation-maximization source separation and localization. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 382–394. [Google Scholar] [CrossRef]
  25. Izumi, Y.; Ono, N.; Sagayama, S. Sparseness-based 2ch BSS using the EM algorithm in reverberant environment. In Proceedings of the 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 21–24 October 2007; pp. 147–150. [Google Scholar]
  26. Araki, S.; Theis, F.; Nolte, G.; Lutter, D.; Ozerov, A.; Gowreesunker, V.; Sawada, H.; Duong, N.Q.K. The 2010 Signal Separation Evaluation Campaign (SiSEC2010): Audio Source Separation. Lect. Notes Comput. Sci. 2010, 6365, 414–422. [Google Scholar]
  27. Koldovsky, Z.; Tichavsky, P.; Oja, E. Efficient Variant of Algorithm FastICA for Independent Component Analysis Attaining the Cram—Rao Lower Bound. IEEE Trans. Neural Netw. 2006, 17, 1265–1277. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Block diagram of the proposed approach. STFT: Short Time Fourier Transform, DSB: Delay-and-Sum Beamforming, ITD: Interaural Time Difference, IFFT: Inverse Fast Fourier Transform, OLA: OverLapping and Adding.
Figure 1. Block diagram of the proposed approach. STFT: Short Time Fourier Transform, DSB: Delay-and-Sum Beamforming, ITD: Interaural Time Difference, IFFT: Inverse Fast Fourier Transform, OLA: OverLapping and Adding.
Sensors 17 01447 g001
Figure 2. Float chart of ITD estimation. τ ^ 1 and τ ^ 2 are the estimation values of τ 1 and τ 2 . If correct estimations of τ 1 and τ 2 are obtained, the cosine characteristics of g T h e [ k ] is identical to g r e a l [ k ] . In spite of the fact that there would be no cosine characteristics in g r e a l [ k ] based on incorrect estimation results, we can still follow the cosine characteristics to calculate g T h e [ k ] . Obviously, g T h e [ k ] is different to g r e a l [ k ] in this situation. We find the true value of τ ^ 1 and τ ^ 2 iteratively. The τ ^ 1 and τ ^ 2 will be updated until g T h e [ k ] is identical to g r e a l [ k ] .
Figure 2. Float chart of ITD estimation. τ ^ 1 and τ ^ 2 are the estimation values of τ 1 and τ 2 . If correct estimations of τ 1 and τ 2 are obtained, the cosine characteristics of g T h e [ k ] is identical to g r e a l [ k ] . In spite of the fact that there would be no cosine characteristics in g r e a l [ k ] based on incorrect estimation results, we can still follow the cosine characteristics to calculate g T h e [ k ] . Obviously, g T h e [ k ] is different to g r e a l [ k ] in this situation. We find the true value of τ ^ 1 and τ ^ 2 iteratively. The τ ^ 1 and τ ^ 2 will be updated until g T h e [ k ] is identical to g r e a l [ k ] .
Sensors 17 01447 g002
Figure 3. Cosine function with different ITD estimation. Obviously, g T h e [ k ] is identical to g r e a l _ r [ k ] with correct ITD estimation, while g T h e [ k ] is different to g r e a l _ r [ k ] with incorrect ITD estimation.
Figure 3. Cosine function with different ITD estimation. Obviously, g T h e [ k ] is identical to g r e a l _ r [ k ] with correct ITD estimation, while g T h e [ k ] is different to g r e a l _ r [ k ] with incorrect ITD estimation.
Sensors 17 01447 g003
Figure 4. Source localization with different b 1 and b 2 . The source localization are conducted in four different settings: (1) b 1 = 1 , b 2 = 1 ; (2) b 1 = 0.7 , b 2 = 1 ; (3) b 1 = 1 , b 2 = 1.5 ; and (4) b 1 = 0.7 , b 2 = 1.5 . The ITD estimation is valid for all of the settings.
Figure 4. Source localization with different b 1 and b 2 . The source localization are conducted in four different settings: (1) b 1 = 1 , b 2 = 1 ; (2) b 1 = 0.7 , b 2 = 1 ; (3) b 1 = 1 , b 2 = 1.5 ; and (4) b 1 = 0.7 , b 2 = 1.5 . The ITD estimation is valid for all of the settings.
Sensors 17 01447 g004
Figure 5. Placement of the microphones and sound sources. S 1 is the target source. S 2 1 and S 2 2 are the competing sources in two different environments, respectively.
Figure 5. Placement of the microphones and sound sources. S 1 is the target source. S 2 1 and S 2 2 are the competing sources in two different environments, respectively.
Sensors 17 01447 g005
Figure 6. ITD estimation results in different environments. The horizontal coordinate corresponds to τ 1 ^ , and the vertical coordinate corresponds to τ 2 ^ . In fact, we can only process the lower triangular matrix because the estimations have symmetric properties.
Figure 6. ITD estimation results in different environments. The horizontal coordinate corresponds to τ 1 ^ , and the vertical coordinate corresponds to τ 2 ^ . In fact, we can only process the lower triangular matrix because the estimations have symmetric properties.
Sensors 17 01447 g006
Figure 7. The target speech performance of different methods in terms of Perceptual Evaluation of Speech Quality (PESQ), C s i g , C b a k and C o v l .
Figure 7. The target speech performance of different methods in terms of Perceptual Evaluation of Speech Quality (PESQ), C s i g , C b a k and C o v l .
Sensors 17 01447 g007
Figure 8. The target speech performance of different microphone distances in terms of Perceptual Evaluation of Speech Quality (PESQ), C s i g , C b a k and C o v l .
Figure 8. The target speech performance of different microphone distances in terms of Perceptual Evaluation of Speech Quality (PESQ), C s i g , C b a k and C o v l .
Sensors 17 01447 g008
Figure 9. ITD estimation results and experimental set-up in room1 and room2. The horizontal coordinate corresponds to τ 1 ^ , and the vertical coordinate corresponds to τ 2 ^ . The distance between two microphones is 8 cm.
Figure 9. ITD estimation results and experimental set-up in room1 and room2. The horizontal coordinate corresponds to τ 1 ^ , and the vertical coordinate corresponds to τ 2 ^ . The distance between two microphones is 8 cm.
Sensors 17 01447 g009
Figure 10. Average Signal-to-Interference Ratio (SIR) with different λ . We calculate the mean of SIR for each λ . The result demonstrates that λ = 0.5 provides the best performance, which is identical to our theoretical analysis. Furthermore, separation results are symmetrical to λ when we adopt the signal-to-noise ratio based on Y 1 [ m , k ] and Y 2 [ m , k ] to generate the ideal binary mask.
Figure 10. Average Signal-to-Interference Ratio (SIR) with different λ . We calculate the mean of SIR for each λ . The result demonstrates that λ = 0.5 provides the best performance, which is identical to our theoretical analysis. Furthermore, separation results are symmetrical to λ when we adopt the signal-to-noise ratio based on Y 1 [ m , k ] and Y 2 [ m , k ] to generate the ideal binary mask.
Sensors 17 01447 g010
Table 1. ITD estimation on S 1 S 2 1 .
Table 1. ITD estimation on S 1 S 2 1 .
Anechoic RT 60 = 150 ms
Method S 1 S 2 1 Method S 1 S 2 1
Real ITD0.0002.373Real ITD0.0002.373
DUET0.0582.370DUET0.5202.560
Phat0.0172.502Phat0.2172.500
Izumi0.0932.502Izumi0.3372.946
Proposed0.0242.402Proposed0.1792.428
Table 2. Interaural Time Difference (ITD) estimation on S 1 S 2 2 .
Table 2. Interaural Time Difference (ITD) estimation on S 1 S 2 2 .
Anechoic RT 60 = 150 ms
Method S 1 S 2 2 Method S 1 S 2 2
Real ITD0.0004.060Real ITD0.0004.060
DUET0.0203.963DUET1.8443.448
Phat0.0554.009Phat0.1174.122
Izumi0.0454.018Izumi0.0434.067
Proposed0.0124.039Proposed0.0424.045
Table 3. ITD estimation on R T 60 = 150 ms with different microphone distances.
Table 3. ITD estimation on R T 60 = 150 ms with different microphone distances.
Mic-Distance5 cm10 cm15 cm
Method S 1 S 2 1 S 1 S 2 1 S 1 S 2 1
Real ITD0.0001.1870.0002.3730.0003.560
DUET0.2711.0690.5202.5601.6783.135
PHAT0.1631.2960.2172.5000.1263.652
Izumi0.2341.3340.3372.9460.0313.891
Proposed0.1121.1250.1792.4280.0413.527
Table 4. Signal-to-Interference Ratio (SIR) evaluations based on room1 and room2.
Table 4. Signal-to-Interference Ratio (SIR) evaluations based on room1 and room2.
Room1x1x2x3x4x5x6
Proposed S 1 11.87.814.726.44.9 0.9
S 2 10.512.2 9.2 2.714.021.2
ICA S 1 0.3 1.3 10.218.6 2.6 7.8
S 2 3.34.8 8.34 7.6 10.018.3
Room2x1x2x3x4x5x6
Proposed S 1 3.36.212.327.53.21.0
S 2 12.811.1 10.0 1.3 15.822.5
ICA S 1 3.2 1.3 6.619.6 4.3 9.1
S 2 6.24.8 7.3 8.5 12.019.4
1 The definition of ICA is “Independent Component Analysis”.

Share and Cite

MDPI and ACS Style

Li, X.; Ding, Z.; Li, W.; Liao, Q. Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation. Sensors 2017, 17, 1447. https://doi.org/10.3390/s17061447

AMA Style

Li X, Ding Z, Li W, Liao Q. Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation. Sensors. 2017; 17(6):1447. https://doi.org/10.3390/s17061447

Chicago/Turabian Style

Li, Xuliang, Zhaogui Ding, Weifeng Li, and Qingmin Liao. 2017. "Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation" Sensors 17, no. 6: 1447. https://doi.org/10.3390/s17061447

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop