Binaural Heterophasic Superdirective Beamforming

The superdirective beamformer, while attractive for processing broadband acoustic signals, often suffers from the problem of white noise amplification. So, its application requires well-designed acoustic arrays with sensors of extremely low self-noise level, which is difficult if not impossible to attain. In this paper, a new binaural superdirective beamformer is proposed, which is divided into two sub-beamformers. Based on studies and facts in psychoacoustics, these two filters are designed in such a way that they are orthogonal to each other to make the white noise components in the binaural beamforming outputs incoherent while maximizing the output interaural coherence of the diffuse noise, which is important for the brain to localize the sound source of interest. As a result, the signal of interest in the binaural superdirective beamformer’s outputs is in phase but the white noise components in the outputs are random phase, so the human auditory system can better separate the acoustic signal of interest from white noise by listening to the outputs of the proposed approach. Experimental results show that the derived binaural superdirective beamformer is superior to its conventional monaural counterpart.


Introduction
Microphone arrays combined with proper beamforming methods have been used in a wide range of applications, such as hearing aids, smart headphones, smart speakers, voice communication, automatic speech recognition (ASR), human-machine interfaces, etc., to extract signals of interest from noisy observations. Many beamformers have been developed over the last few decades [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17], among which the so-called superdirective beamformer [9] is particularly attractive. It is derived by maximizing the directivity factor (DF), which is equivalent to maximizing the gain in signal-to-noise ratio (SNR) in diffuse noise, subject to the distortionless constraint at the endfire direction. So, this beamformer is more efficient than other fixed beamformers for suppressing noise, interference, and reflections incident from different directions in noisy and reverberant environments [14,18]. It also has a frequency-invariant beampattern if the sensor spacing is small, which is essential for acquiring high-fidelity broadband acoustic and speech signals.
However, there is one major drawback with the existing superdirective beamforming approach: the white noise amplification is very serious at low frequencies. As a consequence, the application of this beamformer requires well-designed arrays with high-quality microphones of extremely low self-noise level, e.g., at least below 0 dB(A) for second-and higher-order superdirective beamformers, which is difficult if not impossible to attain. This white noise amplification problem considerably limits the use of the superdirective beamformer in practical applications [18,19] and how to deal with this problem has become an important issue that has attracted a significant amount of research attention. A number of methods have been developed subsequently in the literature, including the so-called robust superdirective beamformer [9,20], the combined superdirective beamformer [21], the optimized superdirective beamformer [22], the subspace superdirective beamformer [11,23], and the reduced-rank superdirective beamformer [24]. While the problem is approached from different perspectives, the fundamental principle underlying those methods stays the same, i.e., making a compromise between the DF and the level of white noise gain (WNG) [10,20,22,25,26]. In other words, all those methods attempt to circumvent the white noise amplification problem by sacrificing the DF and, as a consequence, the resulting beamformer may no longer be superdirective.
In this work, we take a different avenue. Instead of sacrificing the DF to improve the WNG, we design the superdirective beamformer, which consists of two sub-beamformers, each generates an output. The two sub-beamformers are designed to be orthogonal to each other so that the acoustic signal of interest in the binaural outputs is in phase while the (amplified) white noise is random phase. This design is strongly motivated by studies and facts in psychoacoustics, which showed that the location (or direction) information of signals has a significant impact on speech intelligibility in the human auditory system.
Many experiments have been conducted to study the influence of the direction information of speech and white noise (especially at frequencies below 1 kHz) on speech intelligibility . Briefly, the impact of the source direction on the perception of speech in the human binaural auditory system can be classified into two scenarios: in phase and out of phase; while the perception of noise can be divided into three scenarios: in phase, random phase, and out of phase, where in phase means that in every frequency the binaural (two-channel) signals have the same phase and out of phase means that in every frequency the phase difference between the binaural signals is exactly 180 • . An illustration of the impact of binaural signals and noise phase on speech intelligibility, inspired from [49], is shown in Figure 1, where the leftmost column indicates the phase relationship of the white noise at the left and right channels, and the first row represents the phase relationship of binaural speech signals. The combination indicates the influence of interaural direction relations on the localization of speech and white noise in space. The circle shape indicates that the signal is concentrated in a limited area, in front of the head for the case of in phase or on either side of the ears for the case of random phase. The rectangle shape indicates that the signal spreads in the area behind both ears in the case of out of phase.
The blue circle indicates that the speech signals at the left and right channels are in phase but the white noise components at the two channels are in random phase, which is related to the proposed binaural superdirective beamformer.
Listening tests have been conducted to study the intelligibility of different phase combinations. It is confirmed that the phase combination of the desired speech signal and white noise has a significant influence on intelligibility. A list of the most common six scenarios, summarized in [49] and represented in Table 1, can be divided into three categories: antiphasic, heterophasic, and homophasic. Listening tests showed that the antiphasic category corresponds to the highest intelligibility, which can be 25% higher than that of the homophasic category in low SNR scenarios. The intelligibility of the heterophasic category is also higher than that of the homophasic case though lower than that of the antiphasic scenario. Inspired by this, it is desirable to design beamformers that make the binaural outputs corresponding to the antiphasic or heterophasic cases. Figure 1. Illustration of different phase scenarios and the influence of the interaural phase relations on the localization of speech and white noise in space. The circle shape means that the signal is concentrated in a limited area, in front of the head for the case of in phase or on either side of the ears for the case of random phase. The rectangle shape indicates that the signal spreads in the area behind the ears, which is related to the case of out of phase. The red dotted circle indicates that the speech signal and the white noise are both in phase, which is related to the monaural superdirective beamformer. The blue circle indicates that the speech signal is in phase while the white noise is in random phase, which is related to the binaural superdirective beamformer developed in this work. This figure is a modified version of the results in [49]. Table 1. Different scenarios for intelligibility study based on phase relationship between speech and noise [49].

Scenario
Speech Noise Class 1 Out of phase In phase Antiphasic 2 In phase Out of phase Antiphasic 3 In phase Random phase Heterophasic 4 Out of phase Random phase Heterophasic 5 In phase In phase Homophasic 6 Out of phase Out of phase Homophasic In this paper, we use the interaural coherence (IC) to describe the auditory localization information. Consider a diffuse noise field; when the IC of the binaural signals reaches its maximum, i.e., 1, there is a precise region of the sound source, which is located in the middle of the head, i.e., the in-phase case; however, when the binaural signals are completely incoherent, i.e., the IC equals to 0, there are two independent sources at the two ears. This corresponds to the random phase case. In many approaches in the literature, binaural processing generates two collinear filters. The resulting output ICs for white (due to noise amplification) and diffuse noises are 1. So, both the signal of interest and the noise in the human auditory system are perceived to be in the same region. Consequently, our brain will have difficulties to separate them with a binaural presentation and intelligibility will certainly be affected. Apparently, the conventional monaural superdirective beamformer and the conventional collinear binaural processing belong to the homophasic case, which has the lowest intelligibility. To improve intelligibility, we propose a binaural superdirective beamformer by constructing two orthogonal filters, with which the IC for the white noise components is equal to zero while the IC for the desired signal components is equal to one in the binaural outputs. Consequently, with the proposed method, our auditory system can more easily distinguish between the signal of interest and the (amplified) white noise by listening to the binaural outputs, leading to improved intelligibility [50]. An illustration of the proposed binaural superdirective beamformer with a uniform linear array is shown in Figure 2, which can suppress spatial noise while seperating the desired signal and the white noise into different perception zone. Since the binaural superdirective beamformer developed in this paper corresponds to the heterophasic case [49] (see also Table 1), we name it binaural heterophasic superdirective beamformer. Note that the IC information has been used in the traditional binaural speech enhancement methods for binaural cues preservation [51,52]. However, the proposed binaural heterophasic superdirective beamformer uses the IC magnitude information in a very different way where the two orthogonal sub-beamformers are designed to minimize the IC magnitude of the white noise components while maximizing the IC magnitude of the diffuse noise components in the binaural outputs to achieve better perceptual separation of the signal of interest and white noise. The rest of this paper is organized as follows. In Section 2, we present the signal model and formulate the problem. In Section 3, we briefly review the derivation of the conventional superdirective beamformer. In Section 4, we discuss binaural linear filtering and the associated performance measures. In Section 5, we derive the binaural heterophasic superdirective beamformer. Then, in Section 6, we present some experiments to validate the theoretical study. Finally, conclusions are presented in Section 7.

Signal Model and Problem Formulation
We consider a source signal of interest (plane wave), in the farfield, that propagates from the azimuth angle, θ, in an anechoic acoustic environment at the speed of sound, i.e., c = 340 m/s, and impinges on a uniform linear array (ULA) consisting of 2M omnidirectional microphones. In this scenario, the corresponding steering vector (of length 2M) is [5] d(ω, θ) = 1 e −ωτ 0 cos θ · · · e −(2M − 1)ωτ 0 cos θ where  is the imaginary unit with  2 = −1, ω = 2π f is the angular frequency, with f > 0 being the temporal frequency, τ 0 = δ/c is the delay between two successive sensors at the angle θ = 0, with δ being the interelement spacing, and the superscript T is the transpose operator. Assume that the desired signal comes from a specific direction θ = θ s . From the steering vector defined in (1), we can express the frequency-domain observation signal vector of length 2M as [2] where is the zero-mean source signal of interest, which is also called the desired signal, d(ω, θ s ) is the signal propagation vector, which is same as the steering vector at θ = θ s , and v(ω) is the zeromean additive noise signal vector defined similarly to y(ω). We deduce that the 2M × 2M covariance matrix of y(ω) is is the pseudo-coherence matrix of the noise. We assume that noises at different sensors have the same variance. In order to design the superdirective beamformer, we make two basic assumptions [9,19].
(i) The sensor spacing, δ, is much smaller than the acoustic wavelength, λ = c/ f , i.e., δ λ (this implies that ωτ 0 2π). This assumption is required so that the true acoustic pressure differentials can be approximated by finite differences of the microphones' outputs. (ii) The desired source signal propagates from the angle θ s = 0 (endfire direction). Therefore, (2) becomes and, at the endfire, the value of the beamformer beampattern should always be equal to 1 (or maximal).
Our objective in this paper is to derive a binaural superdirective beamformer, which can take advantage of the human binaural auditory system to separate the desired speech signal from white noise so that the intelligibility of the beamformer's output signals will be higher than that of the output of the conventional (monaural) superdirective beamformer. To that end, we will find two various and useful estimates of X(ω), each for one of the binaural channels, so that along with our binaural hearing system, white noise amplification will be perceptually attenuated thanks to this binaural presentation.

Conventional Superdirective Beamformer
The conventional linear fixed beamforming technique is performed by applying a complex weight at the output of each microphone and then sum all the weighted outputs together to get an estimate of the source signal [2,19], i.e., where Z(ω) is the estimate of the desired source signal, X(ω), and h(ω) is a spatial linear filter of length 2M containing all the complex weights. We see from (5) that the distortionless constraint should be Now, we can define the directivity factor (DF) of the beamformer as [3,19,43] where The matrix Γ d (ω) can be viewed as the pseudo-coherence matrix of the spherically isotropic (diffuse) noise. By taking into account the distortionless constraint in (6), the maximization of the DF in (7) leads to the well-known superdirective beamformer [9,53]: whose DF is which, obviously, is maximal. Besides maximizing the DF, the other great advantage of h SD (ω) is that the corresponding beampattern is almost frequency invariant. However, white noise amplification is a tremendous problem. Consequently, the superdirective beamformer can only be used with a very small number of microphones and/or with regularization of the matrix Γ d (ω), but this regularization affects the DF as well as the shape of the beampattern, which makes the beamformer more frequency dependent. Therefore, there is still a great interest to find new ideas to improve this superdirective beamformer.

Binaural Linear Filtering and Performance Measures
In this section, we explain binaural linear filtering in connection with fixed beamforming and propose some important performance measures in this context. The extension of the conventional (monaural) fixed linear beamforming to the binaural case can be done by applying two complex-valued linear filters, h 1 (ω) and h 2 (ω) of length 2M, to the observed signal vector, y(ω), i.e., It is clear that the two distortionless constraints are A very important performance measure is the input SNR, which can be obtained from According to (12), the binaural output SNR can be defined as In the particular case of h 1 (ω) = i i and h 2 (ω) = i j , the binaural output SNR is equal to the input SNR, in which i i and i j are, respectively, the ith and jth columns of I 2M (i.e., the 2M × 2M identity matrix). According to (14) and (15), the binaural SNR gain can be expressed as .
From the above definition, the following two measures that are very helpful for binaural fixed beamforming can be deduced: • the binaural white noise gain (WNG): • and the binaural DF: where Γ d (ω) is defined in the previous section. The beampattern is another fundamental performance measure for fixed beamformers. The binaural beampattern can be defined as In order to have two meaningful estimates of the desired signal, we are going to extensively exploit the interaural coherence (IC) of the noise. It is well known that, in a multi-source environment, the IC (or its modulus) is important for source localization since it is very strongly related to the two principal binaural cues, i.e., the interaural time difference (ITD) and interaural level difference (ILD), that the brain uses to localize sounds. Psychoacoustically, the localization performance decreases when the IC decreases [33]. Furthermore, the IC affects significantly the perception for acoustic field width.
Let A(ω) and B(ω) be two zero-mean complex-valued random variables. The coherence function (CF) between A(ω) and B(ω) is defined as where the superscript * is the complex-conjugate operator. It is clear that 0 ≤ |γ AB (ω)| 2 ≤ 1.
For any pair of sensors (i, j), the input IC of the noise is simply the CF between V i (ω) and V j (ω), i.e., For white noise, the input IC is γ w (ω) = 0, obviously. For diffuse noise, Similarly, we can define the output IC of the noise as the CF between the filtered noises in Z 1 (ω) and Z 2 (ω), i.e., In the particular case of h 1 (ω) = i i and h 2 (ω) = i j , the input and output ICs are equal, i.e., γ i i (ω), i j (ω) = γ[h 1 (ω), h 2 (ω)]. It can be checked that the output ICs of white (with the same power) and diffuse noises can be presented as, separately, and In many approaches in the literature, the two derived filters h 1 (ω) and h 2 (ω) are collinear, i.e., where ς(ω) = 0 is a complex-valued number. In this case, one can check that Since the desired source signal is also fully coherent at any pair of sensors, both the desired signal and noise are perceived in the same region. As a result, our brain will have difficulties to separate them with binaural presentation and intelligibility will certainly be affected. For better separation between white noise and desired source, we should find orthogonal filters since, in this scenario, the output IC for white noise will be equal to 0, the same way is its corresponding input.

Binaural Heterophasic Superdirective Beamformer
In this section, we consider orthogonal binaural filters, i.e., h H 1 (ω)h 2 (ω) = 0, since we want the output IC for white noise to be zero. We also want to maximize γ d [h 1 (ω), h 2 (ω)] so that not only the signals of interest from a point source at the two binaural outputs are coherent, the diffuse (or any correlated) noise at the binaural outputs will be perceived as less diffuse as possible. For that, we will exploit the maximum modes of this CF.
It can be shown that the two orthogonal filters that maximize (24) are [55] In this case, we get the first maximum mode of the CF: with corresponding vectors q +,1 (ω) and q −,1 (ω), where Similarly, we find that all the M maximum modes of the CF are for m = 1, 2, . . . , M, where It can be verified that From (35), the two semi-orthogonal matrices (2M × M) can be written as where with I M being the M × M identity matrix. It can be shown that where are two diagonal matrices of size M × M. Let N be a positive integer number with 2 ≤ N ≤ M (a different value of N gives a different degree of tradeoff between WNG and DF). We define the two semi-orthogonal matrices (of size 2M × N): In the rest, we consider orthogonal filters of the forms: where is a complex-valued filter of length N. For this class of orthogonal filters, the output IC for diffuse noise is where It can be shown that With (47), the binaural WNG, DF, and power beampattern can be expressed as, respectively, and where is a matrix of size N × 2 with the distortionless constraint being To fulfill this constraint, we must take N ≥ 2. Substituting (47) into (12) and using the distortionless constraint, the variance of where Q ±,:N (ω) = Q +,:N (ω) for φ Z 1 (ω) and Q ±,:N (ω) = Q −,:N (ω) for φ Z 2 (ω). In the case of diffuse-plus-white noise, i.e., Γ v (ω) = Γ d (ω) + αI 2M , where α is a parameter that determines the relative level between the diffuse and white noises, (58) simplifies to showing that φ Z 1 (ω) = φ Z 2 (ω). Again, using (57), we find that the cross-correlation between Z 1 (ω) and Z 2 (ω) is whose form for diffuse-plus-white noise is which, as expected, does not depend on the white noise. For Γ v (ω) = Γ d (ω) + αI 2M , the output IC of the estimated signals is We deduce that for large input SNRs, the localization cues of the estimated signals depend mostly on the ones of the desired signal, while for low SNRs, they depend mostly on the ones of the diffuse-plus-white noise.
One possible binaural superdirective beamformer can be obtained by minimizing the sum of the filtered diffuse noise signals subject to the distortionless constraint in (57), i.e., We easily get and the corresponding binaural DF is Therefore, the proposed binaural superdirective beamformer is Note that another form of the binaural superdirective beamformer can be derived by maximizing the binaural DF in (18) subject to the distortionless constraints, which will be left to the reader's investigation to make the paper concise.

Experiments and Analysis
In this section, we study the performance of the developed binaural heterophasic superdirective beamforming method and compare it to the monaural superdirective beamformer through experiments. For fair comparison, the orders of the binaural heterophasic and the monaural superdirective beamformers are set to the same number so the DF of the two beamformers would be similar.

Performance Analysis
We first evaluate the beampattern (given in (55)) of the binaural superdirective beamformer. A ULA is used with an interelement spacing equal to 1 cm. The beampatterns of the derived beamformer are plotted in Figure 3, where M = 6, 8, 10, 12, and N = M/2, at f = 1 kHz. Note that, given a ULA with M microphone sensors, one can design binaural superdirective beamformers of order from 1 to M/2. When the order increases, the DF becomes larger but the WNG becomes smaller. In this work, we only show the case with N = M/2 in the simulation for maximum DF. Two-dimensional (2D) plots of the corresponding beampatterns are shown in Figure 4; one can observe that, in all cases, they are almost frequency invariant.
Next, We study the performance of the binaural heterophasic superdirective beamformer in terms of WNG and DF, according to (53) and (54), respectively. The results for the WNG and DF are plotted in Figure 5, where the first-, second-, third-, and fourth-order binaural superdirective beamformers are designed with M = 6, 8, 10, 12, respectively (this is the basic requirement for the design of the binaural heterophasic superdirective beamformer as shown in Section 5). One can see that the WNGs of binaural superdirective beamformer decrease with the order while the DFs of beamformers increase with the order. Besides, for each order, the DF does not change much with frequency, which is an important property for processing broadband signals like speech. Figure 6 plots the WNGs and DFs of the binaural superdirective beamformers versus parameter N, where 2 ≤ N ≤ M/2. This parameter is introduced to gain flexibility for achieving compromise between WNG and DF. As seen, the DF increases while the WNG decreases with the increase of the value of N. In practice, one can tune the parameter N according to the application requirement. Next, we study the ICs of the binaural heterophasic and conventional superdirective beamformers under the same conditions according to (23) and (24). Figure 7 plots the output ICs of both beamformers as a function of frequency in white and diffuse noises, respectively. As seen, in the diffuse noise case, the ICs of both beamformers are equal to one within the studied frequency range; in the white noise case, the IC of the binaural superdirective beamformer is equal to zero while for the conventional superdirective beamformer it is equal to one. This means that in the two output signals of the binaural superdirective beamformer, the speech signal is completely coherent, while the white noise is completely incoherent; so, the output signals correspond to the heterophasic case as discussed in Section 1, in which the speech and white noise can be regarded as two separate direction sources in space. Figure 8 plots the IC magnitude of the outputs of the binaural heterophasic superdirective beamformer, which is given in (62), versus frequency in different input SNR conditions. One can see that this IC increases with frequency. This is due to the fact that white noise amplification mainly happens at low frequencies. The output IC in white noise is zero, causing the low-frequency output IC of the entire signal to approach zero. This shows the impact of white noise amplification from the perspective of the output IC. For a fixed frequency, it is seen that the output IC increases with the input SNR, and it approaches one at a high input SNR. This can be easily explained: as the input SNR increases, the desired signal component dominates the beamforming output. The output IC of the desired signal is one, so the output IC of the two output signals in this condition also approaches one. Consequently, one can conclude that, for high input SNRs, the localization cues of the estimated signals depend mostly on the ones of the desired signal, while at low SNRs, they depend mostly on the ones of the noise.

Experiments in Real Environments
In this subsection, we evaluate the performance of the proposed binaural heterophasic superdirective beamformer in real acoustic environments. The experiments were conducted in a 10.5 × 6 × 2.8 m conference room. A ULA is used, which consists of 8 microphones, where the elements spacing is δ = 1.1 cm. The SNR of the microphones is 60 dB(A). A photo of the designed array and the experimental setup are shown in Figure 9. To make the experiments repeatable, we first used the microphone array to record sound signals from a loudspeaker located in the ULA's endfire direction. Then, both the conventional and binaural superdirective beamformers were implemented to get the outputs. Figure 10 plots the time-domain signals and their spectrograms of the output signals of the conventional and binaural superdirective beamformers. It is clearly seen that the outputs of both beamformers suffer from serious white noise amplification, where the desired signal is almost covered by the white noise. It is also seen from spectrograms that white noise amplification mainly occurs at low frequencies.
As emphasized previously, the main advantage of the proposed binaural superdirective beamforming method is to have the human binaural auditory system to better separate the signal of interest from white noise after beamforming. To confirm this, we performed some subjective listening experiments. Firstly, we obtained a series of output signals of the implemented conventional and binaural superdirective beamformers. Specifically, we extracted seven audio clips from the "Voice of America" with each of length of 20 s. After playing and recording through the loudspeaker and the microphone array shown in Figure 9, we use the two kind of beamformers to perform superdirective beamforming to obtain the output signals. Then, five subjects were asked to listen to the output signals and draw up the zones of the sound source and white noise on the horizontal plane within a predesigned circle. Finally, we extracted the images sketched by each subject and averaged them to get the experimental results (note that in these experiments, we selected five subjects from the CIAIC-Center of Intelligent Acoustics and Immersive Communications, who are experienced in acoustic analysis and can clearly distinguish auditory events). Here we only provide four different zones, i.e., front, back, left, and right sides, to ask the subjects to choose, where an illustration of the test is shown in Figure 11.  Figure 12 presents the auditory maps averaged from the five subjects from the horizontal plane. As can be seen, for the conventional superdirective beamformer, all signals (desired signal plus white noise) are perceived to be in the middle of the head, which corresponds to the homophasic case. Oppositely, for the binaural superdirective beamforming, the signal of interest is perceived to be in the median plane of the head while white noise is located in each side of the ear, which corresponds to the heterophasic case. As discussed previously and summarized in Table 1, the speech intelligibility in heterophasic case is higher than the homophasic case (approximately 4 dB higher [49]). Consequently, the proposed binaural superdirective beamformer has better intelligibility than the conventional one.  Figure 11. An illustration of the auditory map for subjects (horizontal-plane). During the test, the subjects were asked to mark the areas according to the sound source location they heard through headphones.

Conclusions
In this paper, we addressed the problem of superdirective beamforming with smallspacing microphone arrays. While it can achieve the maximum spatial gain to suppress acoustic noise, the traditional superdirective beamformer suffers from white noise amplification, which is particularly serious at low frequencies. Many methods were developed in the literature to deal with this problem, but they all pay a price of sacrificing the DF and the resulting beamformers may no longer be superdirective. Motivated by studies and facts in psychoacoustics, we developed in this paper a binaural heterophasic superdirective beamformer, which consists of two sub-beamforming filters, each for one of the binaural channels. These two sub-beamformers are constrained to be orthogonal to each other to minimize the IC of the white noise components in the binaural outputs while maximize the IC of the diffuse noise components. As a result, the signal of interest in the binaural superdirective beamformer's outputs is in phase while the white noise is random in phase, so that the human auditory system is able to more easily separate the acoustic signal of interest from white noise by listening to the outputs of the proposed beamformer. Simulations and experiments were carried out to evaluate the performance of the proposed binaural superdirective beamformer. The results corroborate with the theoretical analysis and confirm that the binaural superdirective beamforming corresponds to the heterophasic case studied in psychoacoustics. Based on the listening tests shown in the psychoacoustic study, one can conclude that the improvement in intelligibility is expected to be 4 dB at low SNR conditions in accordance with the psychoacoustic experiments verified in the literature.