Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming

Multiple blind sound source localization is the key technology for a myriad of applications such as robotic navigation and indoor localization. However, existing solutions can only locate a few sound sources simultaneously due to the limitation imposed by the number of microphones in an array. To this end, this paper proposes a novel multiple blind sound source localization algorithms using Source seParation and BeamForming (SPBF). Our algorithm overcomes the limitations of existing solutions and can locate more blind sources than the number of microphones in an array. Specifically, we propose a novel microphone layout, enabling salient multiple source separation while still preserving their arrival time information. After then, we perform source localization via beamforming using each demixed source. Such a design allows minimizing mutual interference from different sound sources, thereby enabling finer AoA estimation. To further enhance localization performance, we design a new spectral weighting function that can enhance the signal-to-noise-ratio, allowing a relatively narrow beam and thus finer angle of arrival estimation. Simulation experiments under typical indoor situations demonstrate a maximum of only 4∘ even under up to 14 sources.


Introduction
Multiple blind sound source localization is an enabling technology for many practical applications such as indoor localization [1,2], radar sensing [3], and robotic navigation [4]. The typical underlying techniques for source localization is Angle of Arrival (AoA) estimation algorithms (hereafter, we will use AoA estimation to denote source localization as well) such as MUSIC [5,6], ESPRIT [7,8], SAGE [9], beamforming [10] or over-determined source separation and localization method [11]. Despite their success in obtaining correct AoA information even in the present of multiple sound sources, these common solutions have a important limitation: they can only locate a number of sources less than the number of microphones in an array. Meanwhile, these algorithms can only achieve sufficient resolution when the input sources have limited signal bandwidth. Otherwise, the performance would be significantly degraded. Using spectral weighting is a feasible approach to improve accuracy, which however, is vulnerable to interference [12].
Considering above limitations, another category of methods use binaural localization cues combined with Time-Frequency (T-F) masking for source separation which allows the separation of more number of input sources than the number of microphones. Motivated by the ability of 3D sound source localization via human ear, the authors of [13] present an algorithm for binaural localization with only two microphones. This algorithm process the mixture with Head Related Transfer Functions followed by a clustering technique, being able to locate multiple concurrent sound sources in both azimuth and elevation directions. However, this algorithm requires a prior knowledge on the number of input sources. In [14], the authors propose to use Interaural Time/Phase Differences (ILD and IPD) for separating and localizing multiple sound sources. However, the computational requirement for this algorithm is too high. A deep learning oriented mask based source separation algorithm is proposed in [15] while its performance is heavily dependent on the amount of training data, which makes the system relatively difficult to implement. In [16], a single channel based source separation algorithm is presented but only report satisfactory results at a limited number of input sources.  Figure 1. Microphone array structure. The upper linear array is appropriate for 2D source localization while the bottom two-layer circular array can be utilized in 3D scenarios.
To this end, in this paper, we propose a novel joint Source seParation and BeamForming algorithm (SPBF) for multiple blind source localization. Our algorithm is based on a novel microphone layout design shown in Figure 1. This layout design contains two groups of microphone arrays. One microphone array is on the top of another and the distance between these two arrays is kept small. The closest two microphones between two array groups make a pair and are used for source separation. Since the arrival time is preserved after source separation, we then individually exploit these de-separated signals for AoA estimation via delay-and-sum beamforming. Such separation-first-then-localization strategy allows to minimize mutual interference from different sound sources, therefore enabling finer AoA estimation. To further obtain accurate beamforming result, we introduce a new spectral weighting function, allowing to get a sharper beam pattern and thus not only provides better AoA estimation resolution but also higher SNR. The major contributions of this paper are summarized as follows: • We design a new microphone array layout. With appropriate signal processing design, it allows to locate a number of sources larger than the number of microphones. • We propose an new weighting function which can largely sharpen correlation peak hence finer AoA estimation resolution and higher SNR.
The rest of the paper is organized as follows. In Section 2, we describe our microphone array layout design consideration in details. Section 3 presents algorithm design. Simulation results are exhibited in Section 4. Finally, Section 5 concludes this paper.

Design of Microphone Layout
The structure of our proposed array layout is shown in Figure 1. It contains two groups of microphone arrays each has the same identical microphone layout. The closest distance between each microphone in an array group is denoted by d g and between two groups is d s . The microphones in each array group are in the same plane. The two distances d g and d s should be properly configured so as to achieve the best performance.
The appropriate distance d g is determined by the bandwidth of input signals. More specifically, to avoid spatial aliasing, the following requirement should be satisfied [17]: where λ min is the minimum wavelength for the signal of interest. Considering human hearing frequency range is 20 Hz∼20 kHz [18][19][20], the maximum distance between sensors should approximately be: d max = λ min 2 = c 2 f m ≈ 340 2×20,000 m = 0.085 m, where c is the sound speed, f m is the maximum frequency present in the sources. The above equation determines the upper bound for d g .
The lower bound for d g is limited by beamforming algorithm. The accuracy of beamforming-based AoA estimation relies on the resolution of Time-Difference-of-Arrival (TDoA) among different microphones. This implies that d g should be as large as possible so as to maximize TDoA. Otherwise, high sampling rate would be needed, which however, requires computational resources and memory bandwidth, making it unsuitable for resource-constraints IoT devices. In particular, the observed sample sequences from two microphones that have the maximum distance in an array should have time delay no less than one sample. This implies that d min = Kd g > t δ = c f s,max , where t δ denotes the time delay incurred by one sample and f s,max denotes the maximum sampling rate. For instance, on a circular array with 6 microphones working at 16 kHz, . This constraint can also help to obtain the maximum operational sampling rate for a given array.
The distance d s affects the performance for source separation. In our design, we utilize Degenerate Unmixing Estimation Technique (DUET) algorithm that only requires two microphones whose distance should be as small as possible so as to prevent phasewarping. To this end, d s should be minimized to its physical limitation. Under this case, the source separation algorithm can achieve optimal performance and preserve as much spatial information for each source as possible.

Robust Source Localization
Our robust source localization involves two steps: blind source separation and beamforming based localization.

Blind Source Separation
To separate mixed signals, we utilize Degenerate Unmixing Estimation Technique (DUET) algorithm [21]. DUET is able to blindly separate an arbitrary number of sources given just two anechoic mixtures namely, two microphones, provided that the time-frequency representations of sources are disjoint [21,22] or non-overlapped entirely, which is true under most cases [21,23]. The demixing processing can thereby be deemed as a partitioning in time-frequency plane.
In DUET, the received mixed signals by two microphones in the time-frequency domain can be written in a simple form as: wherex 1 (τ, ω) denotes the time-frequency representation of the first microphone,ŝ j (τ, ω) is original source signals without delay and attenuation. In the above equation, the subscript j, known as the active index j, indicates which source dominates current frame. The core principle behind DUET is that the ratio of the time-frequency representations can fully characterize the mixing parameters: where Ω j := {(τ, ω) :ŝ j (τ, ω) = 0}. The mixing parameters, namely the local attenuation estimator a(τ, ω) and the local delay estimator δ(τ, ω), can be calculated based on the active source component: The above equation helps to label the current active source. Therefore, in the following steps, we can demix the mixture via binary masking constructed on these mixing parameters. To accomplish this task, for each active index j, an indicator functions is built: At this step, we can demix the sources now: To further enhance the performance, clustering techniques are applied on the estimated attenuation and delay parameters. The number of clusters is the estimated number of sources and the cluster centres are often deemed as the optimal estimation of the mixing parameters for each source. At the last step, we can reconstruct the sources from the time-frequency representations by converting back into the time domain.

Beamforming and Localization
After demixing multiple sources, we utilize beamforming method to locate their angle-of-arrival.

Beamforming Process
The delay-and-sum (hereafter, we will call it vanilla) beamforming [17,24] finds the incident angle by searching the maximal energy over a spherical grid when using a circular array (hereafter, we assume we adopt a circular array). Supposing there are M-microphones, a beamformer output can be defined as: where x m (n) is the signal from the mth microphone and τ m is the respective arrival time delay. The energy of beamformer output over a frame window of N is thus formulated as: Using Equations (8) and (9) can be expanded as: where the term x 2 m (n − τ m ) can be regarded as a constant since τ is relatively small, and R m 1 m 2 denotes cross-correlation between microphone m 1 and m 2 . Since τ can be parameterized by incident angle θ, the problem of AoA estimation via beamforming can thus be formulated as: where φ is the estimated optimal incident angle. The afore-mentioned optimization problem is often solved by a greedy search that often leads to excesive computation power. To reduce computational energy, the solution space is often discretized and a lookup table between θ and τ is constructed so as to efficiently obtain e(τ|θ). The relation between θ and τ is calculated based on far-field assumption: where d i and d j are the positions of the ith and jth microphones, respectively, u is the unit vector indicating the direction of a point source, f s is the sampling rate, and c is the speed of sound in air.

Problem of Vanilla Beamforming Algorithm
The performance of vanilla beamforming algorithm largely depends on cross-correlation. If an incident source has good compression properties [25] where its correlation peak is sharp, the algorithm would get narrow beam width hence better estimation resolution and high SNR. However, in most cases, the sources do not hold such properties, making the beam rather wide hence subject to background interference.

Spectral Weighting Function
Spectral weighting are common approaches to sharpen the beam width, among which GCC-PHAT [26] is the most popular one. However, in GCC-PHAT, each frequency bin of the spectrum contributes equally to the final correlation, making it sensitive to interference.
To balances the contribution of each frequency bin to correlation results, we propose a new weighting function.

X(ω)
}, X σ (ω) is the mean spectral power of noise, estimated in the absence of source signals, α ≤ 1 is a coefficient quantify how conservative the estimated noise power is (default to 0.9), β is normally set to 0.4. G(ω) is Wiener function of prior SNR ξ, given by: where ξ n = E{[X n (ω)] 2 } E{[X n σ (ω)] 2 } , and the super script n here refers to nth time frame. The numerator E{[X n (ω)] 2 } in ξ n could be estimated using decision-directed approach [27]: where γ is a constant and normally is set to 0.9.

Simulation
We conduct extensive simulations to evaluate the performance of our proposed algorithm.

Setup
We evaluate the performance in a typical indoor room with a size of 10 × 10 × 3 m 3 and simulate the multipath signals via ray tracing technique as shown in Figure 2a. Figure 2a depicts a simple setup for demonstration where fourteen types of real-recorded footstep sounds are emitted from different locations as source signals and we place our microphone array in the center of the room. The signal duration is between 20∼50 ms and they are triggered simultaneously. This microphone array contains two cascaded circular sub-arrays as shown in Figure 1. The microphone distance in each group is configured as 4.5 cm and for each pair is 1 cm. Therefore, our array has a total of twelve microphones.

Results
We first demonstrate the performance of each component in SPBF. To start with, we show that DUET is feasible to decompose multiple sound sources with only two microphones and the results are shown in Figure 2b. The results reveal that the demixed signal is highly consistent with the original one. We then inspect the impacts of our proposed weighting function on the beamforming algorithm. In this experiment, we utilize only one sound source (footstep sounds). For comparison, we also implement GCC-PHAT [28] along with several other existing work including ROTH [29] and Smoothed coherence transform(SCOT) [30], the results are shown in Figure 2c. It is observable that our algorithm cannot only locate the ground truth AoA but also exhibit a rather narrow beam pattern, making it noise-resilient. While in comparison, GCC-PHAT, ROTH, and SCOT have many noisy peaks, making it infeasible to correctly locate even a single sound source. Vanilla beamforming algorithm though can identify the correct AoA but have rather wide beam patterns, making it vulnerable to interference. The results clearly demonstrate the effectiveness of our weighting function.
We next check the beamforming results with common algorithms including GCC-PHAT, ROTH, and SCOT in Figure 3a and our SPBF in Figure 3b. It is obersevable that AoA spectrum of SPBF is sharper and has less side lobes than other algorithms, indicating the effectiveness of our proposed algorithm. We have also extensively evaluate the localization performance between vanilla beamforming, GCC-PHAT, SCOT, ROTH, and SPBF. The results presented in Figure 3c demonstrates that the 80-percentile error of SPBF is at least 5.5× improvements over other algorithms. The results clearly demonstrate the superior performance of our proposed algorithm. In the following experiment, we explore the impact of the number of samples on the localization performance and the results are shown in Figure 4a. It is observable that the number of samples or equivalently the duration of signals impose negligible impacts on the localization performance. Since our algorithm can effectively work at a minimal of 500 samples (equivalently 10 ms given a sampling rate of 48 kHz), it is robust to the Doppler effect. We then conduct experiments when the number of sources is larger than the number of microphones and the results are shown in Figure 4b. The results reveal that the number of sources does not impose significant impacts on beamforming results. SPBF can achieve a median accuracy of 2.5 • and a 80-percentile of 1.7 • . The maximum error does not exceed 5 • . We also conduct experiments using a linear array. The results in Figure 4c reveal that the performance of circular array is slightly better than that of linear array due to more microphones.
We explore the impact of multipath effect by operating the source localization algorithms in different room size and the results are shown in Figure 5a. It is observable that the multipath effect can affect the source localization performance as the localization errors increase when the room size becomes smaller, equivalent more severe multipath effect. We next explore the results when multiple sources transmit non-overlapped (nonconcurrently) and overlapped (concurrently). The results shown in Figure 5b reveal that when the sources are overlapped, the performance experience no obvious performance drop, indicating the robustness of our algorithm. We finally perform localization using different types of souce signals including footstep, speech, and whistle and the results are shown in Figure 5c. It can be observed that different types of source inputs can have a large impact on the final performance. As we can see from Figure 5c, the localization performance of footstep and whistle are significantly better than speech. This is due to the fact that footstep and whistle have much more sharp auto-correlation peaks than speech, making the final localization performance much better.

Runtime Performance
We have explored the time cost of each module of our proposed algorithm and report the results in Table 1. The results are obtained on Intel(R) Core(TM) i7-6700HQ CPU @2.60 GHz with 8 GB RAM. The code is implemented in C language. The time overhead for each module in Table 1 is the average result of 100 trials. The source separation module that consumes 262.59 ms dominates the overhead of the proposed algorithm. The time cost for spectral weighting and beamforming is 11.68 ms and 95.76 ms, respectively. Therefore, the total time overhead is 370 ms, which achieves satisfactory runtime performance. We believe that when involving more advanced code optimization technique or hardware architecture such as GPU, the runtime performance can be further improved.

Conclusions
In this paper, we propose to first separate mixed sources and then utilize beamforming with proposed spectral weighting function to locate multiple sources. In particular, we design a new microphone layout that enables top locate more number of sources than the number of microphones in an array. Simulation results demonstrate that the proposed algorithm can achieve significantly better performance than existing solutions.
Author Contributions: H.P. and C.C. They did quite a lot of implementation work. M.H. and T.D. complete the paper draft. R.Z. and J.L. help to revise the paper.All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Ethics Committee of Huazhong University of Science and Technology (protocol code 44162684-2).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data available on request due to privacy restrictions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: