Multiple Sound Sources Localization with Frame-by-Frame Component Removal of Statistically Dominant Source

Multiple sound sources localization is a hot topic in audio signal processing and is widely utilized in many application areas. This paper proposed a multiple sound sources localization method based on a statistically dominant source component removal (SDSCR) algorithm by soundfield microphone. The existence of the statistically weak source (SWS) among soundfield microphone signals is validated by statistical analysis. The SDSCR algorithm with joint an intra-frame and inter-frame statistically dominant source (SDS) discriminations is designed to remove the component of SDS while reserve the SWS component. The degradation of localization accuracy caused by the existence of the SWS is resolved using the SDSCR algorithm. The objective evaluation of the proposed method is conducted in simulated and real environments. The results show that the proposed method achieves a better performance compared with the conventional SSZ-based method both in sources localization and counting.


Introduction
Multiple sound sources localization is an important topic in audio signal processing and has received a lot of attention in over recent decades [1]. An accurate estimation for direction of arrival (DOA) of a sound source is a vital issue in many applications, such as teleconferences, human-machine interaction, hearing aids [2][3][4] and so on. More and more DOA estimation methods have been developed for exhibiting the ability to provide robustness in adverse conditions such as background noise, interfering sources, and reverberation effects [5][6][7]. Moreover, the information obtained by sound source localization could be applied in spatial sound reproduction [8,9], acoustic analysis of enclosed spaces [10] and spatial audio coding [11,12].
In the early years, the research of DOA estimation mainly focused on the Time Difference of Arrival (TDOA) measurements. TDOAs are usually estimated through peak-picking on the Generalized Cross Correlation of signals acquired at microphone pairs, or on the whole set of microphones [13] and the DOA of sound sources are obtained via the mapping from TDOAs to directions. A series of methods for improving the TDOA estimation have been proposed in References [14,15]. However, most of these TDOA-based localization methods are used at a cost of employing excessive microphones to improve the reliability of TDOA estimation in poorly acoustic conditions, whereas a limited number of microphones are available in practical scenarios [16]. Table 1. Summary of the sound source localization method.

Method
Reference Comment TDOA [13][14][15] Employing excessive microphones to improve the reliability MUSIC, ESPRIT [17][18][19] Microphones number more than sources number SRP [20] High computational complexity ICA [16,[21][22][23] Employing directional sparsity of sound sources SCA [24][25][26] Employing W-DO property CMA-based [27] Need excessive microphones for multi-sources localization SSZ-based [28] Instability for multi-sources localization The key contributions of this paper can be summarized as follows. The existence of the SWS in the different number of sources simultaneously occurring scenario is investigated and validated. For dealing with the phenomenon that the SWS cannot be detected by the conventional SSZ-based method, the multiple sound sources localization method based on the SDSCR algorithm is proposed here. Compared with the SSZ-based method, the proposed method achieves robust performance of multiple sound sources localization via the SDSCR algorithm.
The remainder of the paper is organized as follows: Section 2 states problem in the SSZ-based method and investigates the existence of the SWS among soundfield microphones signals. Section 3 introduces the proposed SDSCR algorithm for multiple sound sources localization. Experimental results are presented in Section 4, while conclusions are drawn in Section 5.

Problem in the SSZ-Based Method
In general, the SSZ-based method which associates DOAs to each source by means of a clustering consists of three steps. The first step is searching SSZ by using the signals correlation between different channels of soundfield microphone. In the second step, a histogram of DOA estimates calculated from all detected SSZs is obtained by statistical analysis. Finally, azimuths of all sound sources are estimated after a series of post-processing. The SSZ-based method generally exhibits a good performance for multiple sound sources localization, but as the sound sources number increases, the SSZ-based method exposes a problem that affect localization accuracy, especially when the number of sound sources is greater than six. For instance, we conducted an experiment to illustrate this issue. In this test, a soundfield microphone is used for recording seven sound sources respectively located in 35 • , 75 • , 125 • , 170 • , 210 • , 260 • , and 325 • in an anechoic room. The normalized amplitude DOA estimation histogram obtained by the SSZ-based method is shown in Figure 1. The green stem is the histogram of DOA estimates that employs T-F bins within all detected SSZs and the red curve is the estimated envelope of histogram proceeded by kernel density estimation (KDE). From Figure 1, it can be found that there are only five obvious peaks (around 35 • , 125 • , 170 • , 210 • and 260 • ) in the KDE curve that can be found, while the other two peaks around 75 • and 325 • cannot be detected by means of peak searching processing. A sound source with more associated SSZs (or DOA estimates) has an obviously higher normalized amplitude at the corresponding peak in a DOA histogram, meaning that the azimuth of this sound source has a higher probability of occurrence in all obtained DOA estimates. On the contrary, some sources with less associated SSZs have obviously lower normalized amplitude and become more difficult to search for in a DOA histogram. For Figure 1, the sound source located in 75 • and 325 • cannot be detected in the obtained histogram due to only a small amount of SSZs being associated with the two sound sources compared with other five sound sources. Additionally, from an informal experiment, we find that if the number of SSZs are associated with one sound source less than 4% of the total number of SSZs, the peak corresponding to these sources cannot be detected in the DOA estimation histogram. Generally, in this work, a sound source with less associated DOA estimates in the histogram is referred as to a statistically weak source (SWS), while the other sources are referred as statistically dominant sources (SDS). It can be clearly seen that the performance of the SSZ-based method in sources counting is degraded due to the missed detection of SWSs. Additionally, we consider a result of a sound source with missed detection as a large deviation between the estimated DOA and the true DOA. It means that the missed detection of SWSs also degrades the performance of the SSZ-based method in DOA estimation. Therefore, the problem SSZ-based method faced is the localization accuracy decreasing when the SWS exists, meaning that it is difficult to detect the SWSs among soundfield microphone signals using the SSZ-based method. To solve this problem, we consider retaining the SSZs that are associated with the SWS while filtering out the SSZs that are associated with other sound sources. More details about the implementation process of the presented solution are described in Section 3. Then we conduct a statistical analysis for the existence of the SWS among signals recorded by soundfield microphone and statistical analysis results are given in Section 2.2. normalized amplitude and become more difficult to search for in a DOA histogram. For Figure 1, the sound source located in 75° and 325°cannot be detected in the obtained histogram due to only a small amount of SSZs being associated with the two sound sources compared with other five sound sources. Additionally, from an informal experiment, we find that if the number of SSZs are associated with one sound source less than 4% of the total number of SSZs, the peak corresponding to these sources cannot be detected in the DOA estimation histogram. Generally, in this work, a sound source with less associated DOA estimates in the histogram is referred as to a statistically weak source (SWS), while the other sources are referred as statistically dominant sources (SDS). It can be clearly seen that the performance of the SSZ-based method in sources counting is degraded due to the missed detection of SWSs. Additionally, we consider a result of a sound source with missed detection as a large deviation between the estimated DOA and the true DOA. It means that the missed detection of SWSs also degrades the performance of the SSZ-based method in DOA estimation. Therefore, the problem SSZ-based method faced is the localization accuracy decreasing when the SWS exists, meaning that it is difficult to detect the SWSs among soundfield microphone signals using the SSZbased method. To solve this problem, we consider retaining the SSZs that are associated with the SWS while filtering out the SSZs that are associated with other sound sources. More details about the implementation process of the presented solution are described in Section 3. Then we conduct a statistical analysis for the existence of the SWS among signals recorded by soundfield microphone and statistical analysis results are given in Section 2.2.

Exploring the Swss among Microphone Signals
The problem caused by the SWSs has been pointed out above. We then aim to investigate the existence of the SWS in the different number of sources with simultaneously occurring scenarios. For examining the existence of the SWS, some statistical analysis were taken in an anechoic room and Room 1 (250 ms reverberation time room), respectively. The speech signals from the NTT database were chosen as sound sources for analysis. The recording was simulated by Roomsim [29]. The angle between sources were set as γ = {30°, 50°} and γ = 50° in the anechoic room and Room 1, respectively. Meanwhile, the sources number was ranging in {5, 6, 7} and {3, 4, 5} in the anechoic room and Room 1, respectively. More specifically, for each sources number and angle, we selected 100 segments of data and were satisfied that all sound sources simultaneously occurring in the time domain and at least half of the sound sources are detected accurately by the SSZ-based method to make the statistical analysis. Moreover, a sound source was considered as the SWS when the sound source was lost in detection or the localization error was more than 5 degrees. We collect the percentage of the SWS

Exploring the Swss among Microphone Signals
The problem caused by the SWSs has been pointed out above. We then aim to investigate the existence of the SWS in the different number of sources with simultaneously occurring scenarios. For examining the existence of the SWS, some statistical analysis were taken in an anechoic room and Room 1 (250 ms reverberation time room), respectively. The speech signals from the NTT database were chosen as sound sources for analysis. The recording was simulated by Roomsim [29]. The angle between sources were set as γ = {30 • , 50 • } and γ = 50 • in the anechoic room and Room 1, respectively. Meanwhile, the sources number was ranging in {5, 6, 7} and {3, 4, 5} in the anechoic room and Room 1, respectively. More specifically, for each sources number and angle, we selected 100 segments of data and were satisfied that all sound sources simultaneously occurring in the time domain and at least half of the sound sources are detected accurately by the SSZ-based method to make the statistical analysis.
Moreover, a sound source was considered as the SWS when the sound source was lost in detection or the localization error was more than 5 degrees. We collect the percentage of the SWS number over the total 100 valid statistics, and the results are shown in Figure 2a,b that represent the statistical results of the angle between sources are 30 • and 50 • in anechoic room, respectively.
From another perspective, as the angle between sources decreases, the number of SWS increases. For instance, from Figure 2a,b, the number of sound sources was set as seven, when the angle between sources was set as 50°, the probability of no SWS is 56% and when the angle between sources was set as 30°, the probability of no SWS drops to 27%. The probability of one SWS and two SWSs increases from 40% to 45% and from 3% to 18%, respectively, when the angle between sources drops from 50° to 30°.
In general, from the perspective of sound sources number, when the sound sources number is five, there have no SWS with a high probability. Nevertheless, the probability of occurrence of the SWS is obviously increasing when the sound sources number is six. The existence probability of one SWS is higher than that of no SWS when the sound sources number is seven, which means that the SWS almost always exist.  In order to investigate the existence of the SWS in different scene, we collect the percentage of SWSs number with different sources number in the Room 1 (250 ms reverberation time), the angle between sources is 50° in this experiment. The results are shown in Figure 3. From Figure 3, it is clear that as the number of sound sources increase from three to five, the existence probability of the SWS increases. We can find similar conclusions as the experiment results in the anechoic condition.
From the above experimental results, it can be summarized that the number of SWSs varies with the number of sound sources and the angle between sources. The SWS is ubiquitous, however, the existing SWS affects the accuracy of multiple sound sources localization method which exploit all SSZs for DOA estimation. For addressing the problem introduced by the SWS, we proposed the From Figure 2, a visualized conclusion can be drawn that the existence probability of the SWS has a positive correlation and a negative correlation with the number of sound sources and the angle between sources, respectively. More specifically, as sound sources number increases, the number of SWSs increases (i.e., more sources cannot be detected). For example, from Figure 2a, the angle between sources was set as 50 • , the probability of no SWS (i.e., the number of SWS is zero, all sources can be detected) is nearly 90% when the sound sources number was set as five and the probability of no SWS drops to 70% and 59%, when the sound sources number was six and seven, respectively. On the contrary, the probability of one SWS and two SWSs rises with the increase of sound sources number, which means that as the number of sound sources increases, the existence probability of the SWS rises (i.e., the number of SWSs increase).
From another perspective, as the angle between sources decreases, the number of SWS increases. For instance, from Figure 2a,b, the number of sound sources was set as seven, when the angle between sources was set as 50 • , the probability of no SWS is 56% and when the angle between sources was set as 30 • , the probability of no SWS drops to 27%. The probability of one SWS and two SWSs increases from 40% to 45% and from 3% to 18%, respectively, when the angle between sources drops from 50 • to 30 • .
In general, from the perspective of sound sources number, when the sound sources number is five, there have no SWS with a high probability. Nevertheless, the probability of occurrence of the SWS is obviously increasing when the sound sources number is six. The existence probability of one SWS is higher than that of no SWS when the sound sources number is seven, which means that the SWS almost always exist.
In order to investigate the existence of the SWS in different scene, we collect the percentage of SWSs number with different sources number in the Room 1 (250 ms reverberation time), the angle between sources is 50 • in this experiment. The results are shown in Figure 3. From Figure 3, it is clear that as the number of sound sources increase from three to five, the existence probability of the SWS increases. We can find similar conclusions as the experiment results in the anechoic condition. SDSCR algorithm with joint the intra-frame and inter-frame SDS discrimination. More details about the SDSCR algorithm will be described in the next section.

Proposed Method
From the conclusion of Section 2, it can be obtained that the SWS is ubiquitous in the scenario where multiple sound sources occur simultaneously. This phenomenon is particularly evident for the signals recorded by soundfield microphone. In this section, we present the SDSCR algorithm for multiple sound sources localization using a soundfield microphone, for solving the problem caused by the SWS existing in the detected SSZs. The following process is performed on a frame-by-frame basis. The system block diagram of the proposed scheme is shown in Figure 4. Input A-format signals are transformed into the T-F domain using short-time Fourier transform (STFT) for single-source zone detecting, and B-format signals are obtained from A-format signals by (1). If single-source zones exist in the current frame, the B-format T-F bins of all SSZs need to be extracted for DOA estimation. Thereafter, we present the SDSCR algorithm with joint intra-frame and inter-frame SDS discrimination to remove the partial DOA estimates that are associated with the SDS. The histogram of selected DOA estimates is obtained by statistical analysis. Finally, the number of sources and their DOA estimates are as output after the post-processing process. More details of these processes will be described below.

Review of DOA Estimation Based on the SSZ Detecting
The investigation of Reference [28] has indicated that for recordings of sound scenes where multiple sound sources occur simultaneously, there must be several T-F zones where only one source From the above experimental results, it can be summarized that the number of SWSs varies with the number of sound sources and the angle between sources. The SWS is ubiquitous, however, the existing SWS affects the accuracy of multiple sound sources localization method which exploit all SSZs for DOA estimation. For addressing the problem introduced by the SWS, we proposed the SDSCR algorithm with joint the intra-frame and inter-frame SDS discrimination. More details about the SDSCR algorithm will be described in the next section.

Proposed Method
From the conclusion of Section 2, it can be obtained that the SWS is ubiquitous in the scenario where multiple sound sources occur simultaneously. This phenomenon is particularly evident for the signals recorded by soundfield microphone. In this section, we present the SDSCR algorithm for multiple sound sources localization using a soundfield microphone, for solving the problem caused by the SWS existing in the detected SSZs. The following process is performed on a frame-by-frame basis. The system block diagram of the proposed scheme is shown in Figure 4. Input A-format signals are transformed into the T-F domain using short-time Fourier transform (STFT) for single-source zone detecting, and B-format signals are obtained from A-format signals by (1). If single-source zones exist in the current frame, the B-format T-F bins of all SSZs need to be extracted for DOA estimation. Thereafter, we present the SDSCR algorithm with joint intra-frame and inter-frame SDS discrimination to remove the partial DOA estimates that are associated with the SDS. The histogram of selected DOA estimates is obtained by statistical analysis. Finally, the number of sources and their DOA estimates are as output after the post-processing process. More details of these processes will be described below. SDSCR algorithm with joint the intra-frame and inter-frame SDS discrimination. More details about the SDSCR algorithm will be described in the next section.

Proposed Method
From the conclusion of Section 2, it can be obtained that the SWS is ubiquitous in the scenario where multiple sound sources occur simultaneously. This phenomenon is particularly evident for the signals recorded by soundfield microphone. In this section, we present the SDSCR algorithm for multiple sound sources localization using a soundfield microphone, for solving the problem caused by the SWS existing in the detected SSZs. The following process is performed on a frame-by-frame basis. The system block diagram of the proposed scheme is shown in Figure 4. Input A-format signals are transformed into the T-F domain using short-time Fourier transform (STFT) for single-source zone detecting, and B-format signals are obtained from A-format signals by (1). If single-source zones exist in the current frame, the B-format T-F bins of all SSZs need to be extracted for DOA estimation. Thereafter, we present the SDSCR algorithm with joint intra-frame and inter-frame SDS discrimination to remove the partial DOA estimates that are associated with the SDS. The histogram of selected DOA estimates is obtained by statistical analysis. Finally, the number of sources and their DOA estimates are as output after the post-processing process. More details of these processes will be described below.

Review of DOA Estimation Based on the SSZ Detecting
The investigation of Reference [28] has indicated that for recordings of sound scenes where multiple sound sources occur simultaneously, there must be several T-F zones where only one source

Review of DOA Estimation Based on the SSZ Detecting
The investigation of Reference [28] has indicated that for recordings of sound scenes where multiple sound sources occur simultaneously, there must be several T-F zones where only one source is active. These T-F zones are referred to as single-source zones. Therefore, a highly accurate DOA estimation and sources counting approach for multiple sound sources localization by conducting the DOA estimation in these SSZs has been proposed in Reference [28]. A soundfield microphone consists of four co-located microphones placed at the four non-adjacent corners of a cube [30] which are referred to as Front Left Up (FLU), Front Right Down (FRD), Back Left Down (BLD), and Back Right Up (BRU) microphones, respectively. The raw signals recorded by soundfield microphone are called A-format signal i.e., {S FLU (n, k), S FRD (n, k), S BLD (n, k), S BRU (n, k)}, where n and k represent the frame number and the frequency index, respectively. The A-format signals are transformed to B-format which consists of one omnidirectional (W) and three figure-of-eight directional (X, Y, Z) channels, i.e., {S W (n, k), S X (n, k), S Y (n, k), S Z (n, k)} [31,32] as follows: A speech source is known to be sparse in the short-term T-F domain [26]. Therefore, the detection of SSZs existing among the soundfield microphone signals is implemented in the T-F domain. Due to the directional characteristics of the soundfield microphone, when there is only one active source within some T-F zones, the raw signals recorded in each channel will have a strong correlation. In contrast, for zones where multiple active sources occur simultaneously, the correlation between the recorded signals will be weaker. Therefore, a generalized multichannel cross-correlation coefficient [33,34] (G-MCCC) is defined to measures the correlation of recorded A-format signal among the four channels.
For obtaining the G-MCCC in a T-F zone, in detail, the full frequency band consists of K STFT bins for a frame is divided into C zones L c . L c is a set of adjacent T-F points corresponding to a sub-band of STFT coefficients (c ∈ [1, C] and for simplicity L c is represented by L in this section). In addition, we take {S 1 (n, k), S 2 (n, k), S 3 (n, k), S 4 (n, k)} to represent the A-format recorded signals {S FLU (n, k), S FRD (n, k), S BLD (n, k), S BRU (n, k)} for simplicity.
More specifically, for any pair of soundfield microphone recorded signals S i (n, k) and S j (n, k), a multichannel cross-correlation coefficient is defined as: Then the G-MCCC can be obtained by A necessary and sufficient condition for a zone L to be a single-source zone is where i ∈ {1, 2, 3, 4}, j = i + 1 mod 4 and mod is a function for remainder operation. We search for single-source zones that satisfy the following inequality: where ε is a sufficiently small threshold for a user to define and the value of ε is generally less than 0.1. The DOA estimatesμ(n, k) of each T-F bin in a single-source zone L s can be calculated by processing the B-format signals (obtained from the four channel signals): where Re{ · } denotes taking the real part of the argument and * denotes conjugation. Both the SWS and SDS components are present in DOA estimates. The purpose of the SDSCR algorithm is to remove the component of some SDSs so that the SWS components can be found using histogram statistics. The SDS discrimination within the SDSCR algorithm consists of two steps, one is intra-frame SDS discrimination, the other is inter-frame SDS discrimination. More details about the SDSCR algorithm are presented in the following Sections.

SDSCR Algorithm
Based on the investigation in Section 2, the conclusion can be drawn that the SWS, which is ubiquitous in the multiple sound sources occurring simultaneously scenario, is difficult to detect by the conventional SSZ-based DOA estimation method. Therefore, for dealing with the degradation of localization accuracy caused by the existence of the SWS, we propose an SDSCR algorithm to remove the component of SDSs while preserves the SWSs component in each frame.
The SDSCR algorithm with joint the intra-frame and inter-frame SDS discrimination is proposed in this section as described by block diagram of Figure 5. For the current frame, the DOA estimates of each T-F bin in all detected SSZ (i.e.,μ(n, k)) are chosen as input information for SDSCR algorithm. Then the selected DOA estimates (defined asμ (I) (n, k)) of current frame are obtained by filtering out someμ(n, k) locating in the removal range in removal processing. Finally, the selected DOA estimates are used for obtaining the histogram statistics to estimate the azimuth of each source. In addition, a source azimuth coarse estimation of current frame (i.e., d n ) is estimated for guiding later frames to calculate the removal range by usingμ (I) (n, k). remove the component of some SDSs so that the SWS components can be found using histogram statistics. The SDS discrimination within the SDSCR algorithm consists of two steps, one is intraframe SDS discrimination, the other is inter-frame SDS discrimination. More details about the SDSCR algorithm are presented in the following Sections.

SDSCR Algorithm
Based on the investigation in Section 2, the conclusion can be drawn that the SWS, which is ubiquitous in the multiple sound sources occurring simultaneously scenario, is difficult to detect by the conventional SSZ-based DOA estimation method. Therefore, for dealing with the degradation of localization accuracy caused by the existence of the SWS, we propose an SDSCR algorithm to remove the component of SDSs while preserves the SWSs component in each frame.
The SDSCR algorithm with joint the intra-frame and inter-frame SDS discrimination is proposed in this section as described by block diagram of Figure 5. For the current frame, the DOA estimates of each T-F bin in all detected SSZ (i.e., ̂ , ) are chosen as input information for SDSCR algorithm. Then the selected DOA estimates (defined as ̂ , ) of current frame are obtained by filtering out some ̂ , locating in the removal range in removal processing. Finally, the selected DOA estimates are used for obtaining the histogram statistics to estimate the azimuth of each source. In addition, a source azimuth coarse estimation of current frame (i.e., ) is estimated for guiding later frames to calculate the removal range by using ̂ , . The calculation process of removal range shown in the red dashed box of Figure 5 is realized with the historical source azimuth coarse estimation information (i.e., the source azimuth coarse estimation of look-ahead frames). More specifically, the intra-frame SDS discriminator is used for obtaining the azimuth where the intra-frame SDS is considered as locating. Then, the intra-frame SDS is further discrimination in the inter-frame SDS discriminator to preserve the reasonable azimuth of SDS while eliminating the azimuth that is not the SDS locating. Finally, the removal range of current The calculation process of removal range shown in the red dashed box of Figure 5 is realized with the historical source azimuth coarse estimation information (i.e., the source azimuth coarse estimation of look-ahead frames). More specifically, the intra-frame SDS discriminator is used for obtaining the azimuth where the intra-frame SDS is considered as locating. Then, the intra-frame SDS is further discrimination in the inter-frame SDS discriminator to preserve the reasonable azimuth of SDS while eliminating the azimuth that is not the SDS locating. Finally, the removal range of current frame is calculated by the discriminated SDS.
For the removal range calculation, the two core operations are intra-frame SDS discrimination and intra-frame SDS discrimination, respectively. In general, the purpose of the intra-frame SDS discriminator is searching the intra-frame SDS in each frame. More specifically, azimuth information of a sound source can be calculated for each SSZ, while there are multiple SSZs in a frame with a high probability. In a frame, different sound sources are detected with different times which means that the probability of occurrence of each detected sound source is different in the frame. Therefore, the core idea of the intra-frame SDS discriminator is divided sound sources detected in the frame into the intra-frame SDS and SWS according to the probability of occurrence of each source. According to this phenomenon, the intra-frame SDS discrimination is designed (shown in algorithm1). The input of the intra-frame SDS discriminator are the historical source azimuth coarse estimation information . . , N I , is a vector including the coarse estimation of azimuth for each valid SSZ (i.e., at least half of DOA estimates are not removed in the SSZ) in the (n − i)th frame. N I is the number of look-back frames used to calculate the removal range, d n−i N n−i and N n−i are azimuth coarse estimate and the number of valid SSZs in the (n − i)th frame (i.e., the number of azimuth coarse estimates), respectively.
For obtaining a reasonable intra-frame SDS, it should be noted that there are two thresholds which are referred as to the minimum difference threshold ε a and the minimum distance threshold ε b need to be set. In details, the ε a and ε b are used to count the number of sound sources and intra-frame SDS, respectively. The output of the intra-frame SDS discriminator are where d n−i is a vector consists of intra-frame SDS azimuths in the (n-i)th frame and N n−i is the number of elements in d n−i . Moreover, d n−i are the input of the inter-frame SDS discriminator. The discriminated SDS by Algorithm 1 described above is the intra-frame SDS which is obtained without the historical information (DOA estimates in historical frames). More specifically, among all reserved DOA estimates, there may be only a few DOA estimates that are associated with an intra-frame SDS in the current frame (i.e., only a few DOA estimates are in the same direction as the SDS). These intra-frame SDSs are referred as pseudo-SDSs or local-SDSs and the azimuth of these intra-frame SDSs need to be excluded to form the removal range. Therefore, the inter-frame SDS discriminator is designed to filter out the pseudo-SDS, for avoiding that the azimuth of the pseudo-SDS is within the removal range. The basic idea of the inter-frame SDS discriminator is searching the pseudo-SDS by exploiting the number of DOA estimates that are associated with the intra-frame SDS. The inter-frame SDS discrimination proceeds are shown as Algorithm 2. Similar to the Algorithm 1, two thresholds the removal range threshold ∆µ and the minimum quantity threshold ε c need to be set in the inter-frame SDS discriminator. Specifically, if the DOA of the ith source is µ i , a range around µ i (i.e., [µ i − ∆µ , µ i + ∆µ]) needs to be determined such that the DOA estimates within this range are considered to be derived from source i and ε c is used to filter out the azimuth of pseudo-SDS.
The output of the inter-frame SDS discriminator are d n−1 , . . . , d n−N I .
does not locating in the removal range (i.e.,μ(n, k) / ∈ D), the selected DOA estimateμ (I) (n, k) is obtained as follow: On the contrary, ifμ(n, k) locates in the removal range,μ (I) (n, k) is defined as an illegal element. In addition, the source azimuth coarse estimation vector in the nth frame (d n ) is obtained bŷ µ (I) (n, k). More specifically, in the nth frame, for the searched single-source zone L s ,μ (I) (L s ) is a vector consist ofμ (I) (n, k) where (n, k) ∈ L s . The validity of eachμ (I) (L s ) is calculated and the invalidμ (I) (L s ) (i.e., more than half of theμ (I) (n, k) inμ (I) (L s ) are illegal elements) are filtered out. The azimuth coarse estimation is obtained by calculating the median of reservedμ (I) (L s ). For each valid single-source zone L s , we calculate the azimuth coarse estimation and formed a vector d n . where is the number of elements in (i.e., the number of estimated intra-frame SDS) 19. else, i.e.,  where is the number of elements in (i.e., the number of estimated intra-frame ε a is the minimum difference threshold 5. then m j = j, else Set m j to be an illegal element: m j = 0 6.      Figure 6, where the histogram of allμ(n, k) and µ (I) (n, k) are shown in Figure 6a,b, respectively. The soundfield microphone signals were recorded in an anechoic room, the width of single-source zone was 64, N I and ∆µ were 3 and 3 • respectively. The length of data was 1s. According to the results shown in Figure 6a,b, it can be clearly seen that the proposed method finds two sound sources that are missed by the SSZ-based methods, i.e., the SWSs can be detected by the proposed method. More specifically, from Figure 6a, there are only three obvious peaks which are around 35 • , 170 • and 210 • , while the normalized amplitude of peaks around 125 • and 260 • are obviously lower than other three peaks. It should be noted that the peaks which should appear around 75 • and 325 • are lost in the DOA histogram. This phenomenon means that the existing SWSs affect the accuracy of conventional SSZ-based method for multiple sound sources localization seriously. Contrary to the results shown in Figure 6a where two sources are lost, in Figure 6b, two peaks have already observed around 75 • and 325 • by exploiting the SDSCR algorithm. Meanwhile, the normalized amplitude of two obvious peaks around 35 • and 260 • are decreased comparing to the result shown in Figure 6a. It means that the multiple sound sources localization process will becomes more robust after post-processing. More details about the post-processing are described in Section 3.3.
sound sources localization seriously. Contrary to the results shown in Figure 6a where two sources are lost, in Figure 6b, two peaks have already observed around 75° and 325° by exploiting the SDSCR algorithm. Meanwhile, the normalized amplitude of two obvious peaks around 35° and 260° are decreased comparing to the result shown in Figure 6a. It means that the multiple sound sources localization process will becomes more robust after post-processing. More details about the postprocessing are described in Section 3.3.

Post-Processing
The post-processing consists of two steps which are kernel density estimation (KDE) and peak searching. For obtaining a smooth estimated envelope curve of DOA histogram, KDE is selected in this paper. KDE belongs to a non-parametric estimation method for probability density function estimation. The sooth estimated curve can be generated using a series of observation points (i.e., DOA estimates). It means that the probability density function of a series of observation points can be estimated by KDE. If we suppose , , … , represents independent and identically distributed sample which belongs to a certain distribution with a density function , the estimated shape of this density function can be obtained from its kernel density estimator, which is given by: where is kernel function and 0 is a smoothing parameter usually referred as the bandwidth. The input is all selected DOA estimates (i.e., ̂ , ) and the output is estimated a KDE curve by (8). The KDE curve of the DOA histogram shown in Figure 6b is represented by the red curve in Figure 7. It can be observed that each peak of DOA histogram is corresponding to a local maximum value in the KDE curve.
After obtaining the KDE curve, peak searching processing is exploited for the KDE curve to determine the estimated azimuth of sound sources. The core idea of peak searching is to search the local maximum value in the KDE curve of a DOA histogram. There are two important thresholds need to be set in the peak searching process, which are the peak detection threshold (PDT) and the minimum difference threshold . The is used to determine whether two adjacent local maximum values belong to the same sound source azimuth and the is used to determine whether a local maximum value can be considered as a peak. Meanwhile, it should be noted that each peak is regarded as a sound source. Therefore, the peak searching process completes the task of sound

Post-Processing
The post-processing consists of two steps which are kernel density estimation (KDE) and peak searching. For obtaining a smooth estimated envelope curve of DOA histogram, KDE is selected in this paper. KDE belongs to a non-parametric estimation method for probability density function estimation. The sooth estimated curve can be generated using a series of observation points (i.e., DOA estimates). It means that the probability density function of a series of observation points can be estimated by KDE. If we suppose y 1 , y 2 , . . . , y L K represents L K independent and identically distributed sample which belongs to a certain distribution with a density function f (·), the estimated shape of this density functionf (·) can be obtained from its kernel density estimator, which is given by: where K(·) is kernel function and h(h > 0) is a smoothing parameter usually referred as the bandwidth. The input is all selected DOA estimates (i.e.,μ (I) (n, k)) and the output is estimated a KDE curve by (8).
The KDE curve of the DOA histogram shown in Figure 6b is represented by the red curve in Figure 7. It can be observed that each peak of DOA histogram is corresponding to a local maximum value in the KDE curve.  Figure 7 where the data is the same as Figure 1. It can be clearly seen that there are seven accurate peaks around the DOAs of real sources, i.e., all the estimated DOAs are almost equal with the real ones. Moreover, compared with Figure 1, not only can the sound source azimuths which are lost in Figure 1 be found in the DOA histogram obtained through the proposed SDSCR algorithm, but also the localization accuracy of other sound source azimuths is improved. Additionally, a statistical analysis for the SWS was taken for verifying the validity of proposed method. We exploited the same data which were used for the SWS counting in Figure 2b (angle between sources was set as 30°). The results of the SWSs number recount are shown in Figure 8. It can be observed that the number of SWSs with different sound sources number has decreased significantly compared with Figure 2b. When the sources number is five, the existence probability of After obtaining the KDE curve, peak searching processing is exploited for the KDE curve to determine the estimated azimuth of sound sources. The core idea of peak searching is to search the local maximum value in the KDE curve of a DOA histogram. There are two important thresholds need to be set in the peak searching process, which are the peak detection threshold ε p (PDT) and the minimum difference threshold ε a . The ε a is used to determine whether two adjacent local maximum values belong to the same sound source azimuth and the ε p is used to determine whether a local maximum value can be considered as a peak. Meanwhile, it should be noted that each peak is regarded as a sound source. Therefore, the peak searching process completes the task of sound sources counting at the same time. The result of the peak searching is shown in Figure 7 where the data is the same as Figure 1. It can be clearly seen that there are seven accurate peaks around the DOAs of real sources, i.e., all the estimated DOAs are almost equal with the real ones. Moreover, compared with Figure 1, not only can the sound source azimuths which are lost in Figure 1 be found in the DOA histogram obtained through the proposed SDSCR algorithm, but also the localization accuracy of other sound source azimuths is improved.
Additionally, a statistical analysis for the SWS was taken for verifying the validity of proposed method. We exploited the same data which were used for the SWS counting in Figure 2b (angle between sources was set as 30 • ). The results of the SWSs number recount are shown in Figure 8. It can be observed that the number of SWSs with different sound sources number has decreased significantly compared with Figure 2b. When the sources number is five, the existence probability of one SWS drops from 38% to 17% which is more than half. It should be noted that the probability of no SWS has achieved different degrees of increasing in different sources number. More specifically, when the sources number was seven, the probability of no SWS rises from nearly 30% to nearly 60%. From the above results, the proposed localization method effectively makes the number of SWSs decreasing to reduce the impact of the presence of the SWS on localization results, and improves the localization accuracy compared with the conventional SSZ-based method. A more detailed evaluation of the proposed method is presented in next section.
DOAs of real sources, i.e., all the estimated DOAs are almost equal with the real ones. Moreover, compared with Figure 1, not only can the sound source azimuths which are lost in Figure 1 be found in the DOA histogram obtained through the proposed SDSCR algorithm, but also the localization accuracy of other sound source azimuths is improved. Additionally, a statistical analysis for the SWS was taken for verifying the validity of proposed method. We exploited the same data which were used for the SWS counting in Figure 2b (angle between sources was set as 30°). The results of the SWSs number recount are shown in Figure 8. It can be observed that the number of SWSs with different sound sources number has decreased significantly compared with Figure 2b. When the sources number is five, the existence probability of one SWS drops from 38% to 17% which is more than half. It should be noted that the probability of no SWS has achieved different degrees of increasing in different sources number. More specifically, when the sources number was seven, the probability of no SWS rises from nearly 30% to nearly 60%. From the above results, the proposed localization method effectively makes the number of SWSs decreasing to reduce the impact of the presence of the SWS on localization results, and improves the localization accuracy compared with the conventional SSZ-based method. A more detailed evaluation of the proposed method is presented in next section.

Results and Discussion
In this section, the experiments were conducted to evaluate the performance of the proposed approach in simulated and real environments. We compared the performance of the proposed method with the SSZ-based method [28] from two aspects which were accuracy of DOA estimation and sources counting respectively. The CMA-based method [27] was employed as a reference method in the evaluation of sources counting due to the model having achieved an approximately perfect performance in the sources counting.
For the DOA estimation case, both sources number and angle between sources were considered to value the accuracy of the proposed method and the SSZ-based method. We conducted the evaluation on the accuracy by mean absolute estimated error (MAEE) with different sources number and angle between sources in the anechoic room and reverberation room were tested, respectively. For the sources counting case, the percent of correct estimated number both in the anechoic and reverberation room were calculated for evaluating the performance of the proposed method, SSZ-based method and CMA-based method. For different test conditions, the experimental parameters and their corresponding values are listed in Table 2.

The Evaluation of Localization Performance in Simulated Environments
For the evaluation of localization performance, both the DOA estimation and sources counting need to be evaluated. The evaluation of DOA estimation was conducted in 4 different scenarios, i.e., Anechoic room, Room 1, Room 2, and Room 3. The parameters of these rooms are listed in Table 3. NTT is a speech database containing various speakers of different countries. The Chinese speech sub-data-base in NTT was employed as the test database in this section. We simulated the soundfield microphone system with the ROOMSIM package [29]. Both in the anechoic and reverberation room, the length of the simulated room was 6.25 m, the width was 3.75 m, and the height was 2.5 m. For the anechoic room, the sound sources number was 5 to 7 and the angle between sources were spaced by 10 • from 30 • to 50 • . For the reverberation room, the sound sources number and the angle between sources were 3 to 5 and 50 • , respectively. All sound sources were distributed on a circle with 2 m diameter, which were at the same height with the soundfield microphone and elevation angles were equal to 0 • . We selected 72 and 36 different orientations for each sound sources number and angle between sources to evaluating localization accuracy of proposed method in the anechoic and reverberation room respectively. Moreover, for testing the performance all around the microphone, it should be mentioned that the initial orientation was 0 • of the first source and then shifted by 5 • and 10 • in the anechoic and reverberation room respectively. The MAEE is utilized to measure the performance of the proposed method and SSZ-based method, which calculates the difference between the true DOA and the estimated DOA. The MAEE can be calculated by: where µ ij is the true DOA of the jth active source in the ith experiment, andμ ij is the corresponding estimated DOA. P i is the number of active sources, whileP i is the estimated number of sources in the ith experiment and N is the total number of experiments. P max = max{P i ,P i }. The following cases should be noted in (10): The MAEE is collected in a series of experiments proceeded in different scenarios.

DOA Estimation in Anechoic Room
We measured the MAEE of the estimated DOA with different sources number and angle between sources in anechoic room for the proposed method and SSZ-based method [28] respectively. The results are shown in Figure 9 where can be concluded that both of two methods exhibits a good performance for multiple sound sources localization. Meanwhile, compared with SSZ-based method, the proposed method has lower MAEE in the same condition.
where is the true DOA of the jth active source in the ith experiment, and ̂ is the corresponding estimated DOA.
is the number of active sources, while is the estimated number of sources in the ith experiment and N is the total number of experiments. max , . The following cases should be noted in (10): The MAEE is collected in a series of experiments proceeded in different scenarios. We measured the MAEE of the estimated DOA with different sources number and angle between sources in anechoic room for the proposed method and SSZ-based method [28] respectively. The results are shown in Figure 9 where can be concluded that both of two methods exhibits a good performance for multiple sound sources localization. Meanwhile, compared with SSZ-based method, the proposed method has lower MAEE in the same condition. From Figure 9, it can be found that proposed method provides a more accurate DOA estimation performance compared with the SSZ-based method. More specifically, when the angle between sources was set as 50° and the sound sources number were 7, the MAEE of the SSZ-based method and proposed method are 4.7° and 4.1°, respectively. Compared with the MAEE of the SSZ-based method, the MAEE of the proposed method decreasing 0.6° means that the localization accuracy improves by more than 10%. When the sound sources number is 7 and the angle between sources is 30°, compared with the SSZ-based method, the localization accuracy of proposed method improving From Figure 9, it can be found that proposed method provides a more accurate DOA estimation performance compared with the SSZ-based method. More specifically, when the angle between sources was set as 50 • and the sound sources number were 7, the MAEE of the SSZ-based method and proposed method are 4.7 • and 4.1 • , respectively. Compared with the MAEE of the SSZ-based method, the MAEE of the proposed method decreasing 0.6 • means that the localization accuracy improves by more than 10%. When the sound sources number is 7 and the angle between sources is 30 • , compared with the SSZ-based method, the localization accuracy of proposed method improving by nearly 20%. Moreover, two obvious trends can be observed that the difference of the MAEE between two methods become larger with the increasing of the sound sources number or decreasing of the angle between sources, which means that the proposed method has a more significant improvement compared with the SSZ-based method in the performance of localization. In addition, this phenomenon is caused by a large amount of SWSs which is consistent with the obtained conclusion (i.e., with the sound sources number increasing or the angle between sources decreasing, there are more and more the number of SWSs existing among the soundfield microphone signals.) in Section 2. In conclusion, from the results shown in Figure 9, the proposed method achieves a better localization performance compared with the SSZ-based method.

DOA Estimation in Reverberant Room
To investigate the robustness of the proposed method in reverberant environments, a set of experiments were proceeded in this section. The evaluation was conducted in four simulated rooms {Quiet room, Room 1, Room 2, Room 3}, and the parameters of them are listed in Table 3. The MAEE with different sources number in four rooms is shown in Figure 10.

DOA Estimation in Reverberant Room
To investigate the robustness of the proposed method in reverberant environments, a set of experiments were proceeded in this section. The evaluation was conducted in four simulated rooms {Quiet room, Room 1, Room 2, Room 3}, and the parameters of them are listed in Table 3. The MAEE with different sources number in four rooms is shown in Figure 10.
As the reverberation time increases, although the DOA estimation accuracy of the proposed method and SSZ-based method decrease, however, the two methods still exhibit a reliable localization accuracy. Additionally, it can be observed that compared with the SSZ-based method, the proposed method has a lower MAEE in all experimental conditions which means that the proposed method achieves a better localization performance in the reverberant condition. With the sound sources number increasing, the same conclusions as in the case of without reverberation experiments can be drawn. Moreover, with the reverberation time increasing, compared with the localization accuracy of the SSZ-based method, the improvement of the proposed method for the localization accuracy is more obvious. In conclusion, the proposed method exhibits the ability to provide robustness under reverberation conditions.

The Evaluation of Sources Counting
Aiming to evaluate the accuracy of sources counting, the evaluation was conducted in five different simulated scenarios (i.e., Anechoic room, Quiet room, Room 1, Room 2 and Room 3). In each simulated room, we tested a certain number of simultaneously active sources ranging from 3 to 5. The angle between sources was set as 50°. The results of the estimated sources counting accuracy with 95 % confidence intervals are shown in Figure 11. As the reverberation time increases, although the DOA estimation accuracy of the proposed method and SSZ-based method decrease, however, the two methods still exhibit a reliable localization accuracy. Additionally, it can be observed that compared with the SSZ-based method, the proposed method has a lower MAEE in all experimental conditions which means that the proposed method achieves a better localization performance in the reverberant condition. With the sound sources number increasing, the same conclusions as in the case of without reverberation experiments can be drawn. Moreover, with the reverberation time increasing, compared with the localization accuracy of the SSZ-based method, the improvement of the proposed method for the localization accuracy is more obvious. In conclusion, the proposed method exhibits the ability to provide robustness under reverberation conditions.

The Evaluation of Sources Counting
Aiming to evaluate the accuracy of sources counting, the evaluation was conducted in five different simulated scenarios (i.e., Anechoic room, Quiet room, Room 1, Room 2 and Room 3). In each simulated room, we tested a certain number of simultaneously active sources ranging from 3 to 5. The angle between sources was set as 50 • . The results of the estimated sources counting accuracy with 95% confidence intervals are shown in Figure 11. It could be seen from Figure 11a,b that when sources number equal 3, the proposed method and two comparison methods exhibit 100% estimated sources counting accuracy under both anechoic room and quiet room. When the sources number is equal to 4   It could be seen from Figure 11a,b that when sources number equal 3, the proposed method and two comparison methods exhibit 100% estimated sources counting accuracy under both anechoic room and quiet room. When the sources number is equal to 4 and 5, three methods still show a good performance in the estimated sources counting accuracy, but the accuracy of the proposed method is slightly higher than the SSZ-based and CMA-based method. Figure 11c-e show that as the reverberation time increases, the estimated sources counting accuracy of the three methods decreases, but the proposed method still shows better sources counting accuracy than the SSZ-based and CMA-based method. More specifically, when the sources number equal to 5 and reverberation time is 450 ms, the sources counting accuracy of SSZ-based method and CMA-based method are both under 90%, while the sources counting accuracy of proposed method can still reach to 90%. A conclusion can be drawn that the proposed method is superior to SSZ-based and CMA-based method in the accuracy of sources counting.

The Evaluation of Localization Performance in Real Environments
After showing the evaluation results in simulated environments, the proposed SCSDR algorithm is further evaluated using soundfield microphone signals recorded in a real meeting room with an reverberation time approximately equal to 300 ms. The room is of dimension 6.5 m × 4.7 m × 2.8 m. The soundfield microphone (Twirling 720 VR Audio Recorder) [35] was placed at the center of room at a height of 1.5m and all speakers (i.e., sound source) are 1.3 m from the soundfield microphone. Additionally, the signal to noise ratio in the room was estimated at 20 dB. We evaluated the performance of the proposed method for different simultaneously active speakers number and angle between speakers. More specifically, the number of speakers was set as two and three, and the angle between speakers was 60 • and 120 • . For each pair of experimental parameters (i.e., the speakers number and angle between speakers), we selected 12 different locations for evaluation.
Results of evaluation for DOA estimation in the real environment are shown in Figure 12. From Figure 12, it can be found that both the proposed method and SSZ-based method provide a reliable result of localization in the real environment. Meanwhile, the performance of the proposed method is slightly better than that of the SSZ-based method. Compared with the results in simulated environments, we can find that the DOA estimation accuracy of the proposed method in the real environment is lower than that in the simulated environment. This phenomenon is caused by disturbing certain parameters such as ambient noise in the meeting room and reflection of objects (such as the desktop etc.). performance in the estimated sources counting accuracy, but the accuracy of the proposed method is slightly higher than the SSZ-based and CMA-based method. Figure 11c-e show that as the reverberation time increases, the estimated sources counting accuracy of the three methods decreases, but the proposed method still shows better sources counting accuracy than the SSZ-based and CMAbased method. More specifically, when the sources number equal to 5 and reverberation time is 450 ms, the sources counting accuracy of SSZ-based method and CMA-based method are both under 90%, while the sources counting accuracy of proposed method can still reach to 90%. A conclusion can be drawn that the proposed method is superior to SSZ-based and CMA-based method in the accuracy of sources counting.

The Evaluation of Localization Performance in Real Environments
After showing the evaluation results in simulated environments, the proposed SCSDR algorithm is further evaluated using soundfield microphone signals recorded in a real meeting room with an reverberation time approximately equal to 300 ms. The room is of dimension 6.5 m × 4.7 m × 2.8 m. The soundfield microphone (Twirling 720 VR Audio Recorder) [35] was placed at the center of room at a height of 1.5m and all speakers (i.e., sound source) are 1.3 m from the soundfield microphone. Additionally, the signal to noise ratio in the room was estimated at 20 dB. We evaluated the performance of the proposed method for different simultaneously active speakers number and angle between speakers. More specifically, the number of speakers was set as two and three, and the angle between speakers was 60° and 120°. For each pair of experimental parameters (i.e., the speakers number and angle between speakers), we selected 12 different locations for evaluation.
Results of evaluation for DOA estimation in the real environment are shown in Figure 12. From Figure 12, it can be found that both the proposed method and SSZ-based method provide a reliable result of localization in the real environment. Meanwhile, the performance of the proposed method is slightly better than that of the SSZ-based method. Compared with the results in simulated environments, we can find that the DOA estimation accuracy of the proposed method in the real environment is lower than that in the simulated environment. This phenomenon is caused by disturbing certain parameters such as ambient noise in the meeting room and reflection of objects (such as the desktop etc.).

Conclusions
In this paper, we found that SWSs, which are difficult to be detected by the conventional SSZbased method, are ubiquitous when multiple sound sources occur simultaneously. The missed detection of SWSs degrades the performance of the SSZ-based method both in DOA estimation and

Conclusions
In this paper, we found that SWSs, which are difficult to be detected by the conventional SSZ-based method, are ubiquitous when multiple sound sources occur simultaneously. The missed detection of SWSs degrades the performance of the SSZ-based method both in DOA estimation and sources counting. In order to find out the SWSs and improve the localization accuracy, we proposed a SDSCR algorithm for multiple sound sources localization. The SDSCR algorithm exploits joint intra-frame and inter-frame SDS discrimination to classify DOA estimates of all T-F bins into two categories (i.e., SWS components and SDS components). After the operation that retains SWS components and removes SDS components, the SWSs can be detected in histogram Statistics. The results show that the proposed method has a higher accuracy of DOA estimation as well as sources counting compared with the conventional SSZ-based method, and that it is robust in a reverberant environment. In future work, we will investigate the performance of the proposed method in various scenarios involving more sources with closer DOAs.