3D Multiple Sound Source Localization by Proposed Cuboids Nested Microphone Array in Combination with Adaptive Wavelet-Based Subband GEVD

Sound source localization is one of the applicable areas in speech signal processing. The main challenge appears when the aim is a simultaneous multiple sound source localization from overlapped speech signals with an unknown number of speakers. Therefore, a method able to estimate the number of speakers, along with the speaker’s location, and with high accuracy is required in real-time conditions. The spatial aliasing is an undesirable effect of the use of microphone arrays, which decreases the accuracy of localization algorithms in noisy and reverberant conditions. In this article, a cuboids nested microphone array (CuNMA) is first proposed for eliminating the spatial aliasing. The CuNMA is designed to receive the speech signal of all speakers in different directions. In addition, the inter-microphone distance is adjusted for considering enough microphone pairs for each subarray, which prepares appropriate information for 3D sound source localization. Subsequently, a speech spectral estimation method is considered for evaluating the speech spectrum components. The suitable spectrum components are selected and the undesirable components are denied in the localization process. The speech information is different in frequency bands. Therefore, the adaptive wavelet transform is used for subband processing in the proposed algorithm. The generalized eigenvalue decomposition (GEVD) method is implemented in sub-bands on all nested microphone pairs, and the probability density function (PDF) is calculated for estimating the direction of arrival (DOA) in different sub-bands and continuing frames. The proper PDFs are selected by thresholding on the standard deviation (SD) of the estimated DOAs and the rest are eliminated. This process is repeated on time frames to extract the best DOAs. Finally, K-means clustering and silhouette criteria are considered for DOAs classification in order to estimate the number of clusters (speakers) and the related DOAs. All DOAs in each cluster are intersected for estimating the position of the 3D speakers. The closest point to all DOA planes is selected as a speaker position. The proposed method is compared with a hierarchical grid (HiGRID), perpendicular cross-spectra fusion (PCSF), time-frequency wise spatial spectrum clustering (TF-wise SSC), and spectral source model-deep neural network (SSM-DNN) algorithms based on the accuracy and computational complexity of real Electronics 2020, 9, 867; doi:10.3390/electronics9050867 www.mdpi.com/journal/electronics Electronics 2020, 9, 867 2 of 28 and simulated data in noisy and reverberant conditions. The results show the superiority of the proposed method in comparison with other previous works.

introduced that affect the path loss exponent. In addition, it is shown that the energy-based methods for SSL determine the appropriate path loss exponent accurately.
Stefanakis et al. proposed a perpendicular cross-spectra fusion (PCSF) method for SSL by the use of a planer microphone array [35]. Here, the PCSF method is introduced as a new algorithm for DOA estimation which uses the analytic formulas in time-frequency (TF) domain. Also, the proposed method estimates the multiple DOAs in TF domain for simultaneous sound sources. In addition, a coherence criterion based on the divergence property of DOA estimations is introduced for evaluating the reliability of different parts of a speech signal in order to prepare the robustness in undesirable conditions. Ma et al. proposed a binaural SSL method by the combination between the spectral source model and deep neural network (SSM-DNN) [36]. The proposed method is based on a new framework for binaural source localization that combines the model-based information of spectral features of sound sources and DNNs. Initially, a background source model and a target source model are estimated in the phase training step in order to extract the spectral features of sound signals. When the background source identity is unknown, a universal background model is considered for the learning phase. In the next step, the source modes are jointly used for improving the localization process by selecting weighted source azimuth and DNN-based localization algorithms. Finally, the proposed method uses the combination between model-based and data-driven for expressing a single-computational framework in undesirable conditions. Yang et al. proposed a TF-wise spatial spectrum clustering (TF-wise SSC) method for multiple SSL by the use of a microphone array [37]. The proposed TF method based on spatial spectrum clustering is divided into two steps. In the first step, the spatial correlation matrix is calculated by microphone signals and is denoised in the TF domain. The TF spatial spectrum is estimated based on the sub-band information and, then, is enhanced by an exponential transform. In the second step, the source locations are calculated by searching for the maximum of global spatial spectrum. The spatial spectrum is reassigned after the detection of each source, which is considered for locating the next speaker. This process is continuing in order to detect all speaker.
In this article, a novel method is proposed for multiple simultaneous SSL in undesirable conditions. The spatial aliasing between microphone signals in microphone arrays decreases the precision of localization algorithms. Firstly, a cuboids nested microphone array (CuNMA) is proposed with a proper distribution of microphones. The microphone array is structured to prepare enough microphone pairs for each frequency band related to the nested microphone array. The spatial aliasing is eliminated by the use of the CuNMA and all microphone information are suitable for localization process. The speech spectral components are different in frequency bands and there is not useful information on some frequencies during speech recording, which decreases the localization accuracy due to the prevailing noise. A spectral estimation step is proposed for detecting the proper speech spectrum area such that the localization accuracy is increased by removing the undesirable frequency components and the computational complexity is decreased by processing less information. Speech is a wideband and non-stationary signal with the windowed-disjoint orthogonality (W-DO) property [38]. Therefore, each TF bin is related to one speaker with high probability in multi-speaker conditions. Since processing all TF points has high computational complexity, the adaptive wavelet transform is considered for subband processing. The advantage of using the wavelet transform comes from the variable frequency resolution proportional to speech signal. Therefore, the wavelet transform is designed in a way to prepare high frequency resolution in low frequency components related to speech spectral information. The generalized eigenvalue decomposition (GEVD) algorithm is a feasible method for estimating the impulse response between source and microphone signals for DOA estimation. In this article, the proposed subband GEVD (SBGEVD) algorithm resulting from wavelet transform is considered for DOA estimation of all microphone pairs in CuNMA. Subsequently, the PDFs for estimated DOAs are plotted in each sub-band. The mathematical expectation and standard deviation (SD) are two important parameters in these PDFs for DOA estimations. Whatever the DOA estimations are closer to each other in sub-bands, it is more likely that there is just one independent speech source in this sub-band.
Subsequently, by thresholding on the SDs, the PDFs with more SD are removed and this process is repeated for all time frames. Finally, the K-means clustering is implemented on all passed DOAs and the number of speakers is estimated by the silhouette criteria. After this estimation and considering DOAs for each cluster, the 3D sound source locations are calculated by intersecting between all DOAs in each cluster. The DOAs for each microphone pair is plotted and the intersection point (or the closest point to all DOA planes in the case of no existence one point) is considered as the 3D source location. The proposed method is implemented on real and simulated data for two and three simultaneous speakers. Also, the proposed method is compared with HiGRID [33], PCSF [35], TF-wise SSC [37], and SSM-DNN [36] algorithms.
In Section 3, the microphone signal model is introduced jointly with the proposed cuboids nested microphone array. Section 4 represents the proposed algorithm based on spectral estimation, subband processing with adaptive wavelet transform, and subband GEVD. Also, the PDFs, clustering, silhouette criteria, and intersections between DOAs are shown in this section. Section 5 shows the simulations and results for the proposed method in comparison with other previous works on real and simulate data. Section 6 presents some conclusions.

Microphone Signal Model and Cuboids Nested Microphone Array
The microphone signal modeling is the first step in the evaluations for localization and tracking algorithms to prepare the simulated signals similar to real conditions. Firstly, the microphone signal model is introduced and, then, the cuboids nested microphone array with analysis filter bank and down samplers are proposed as a proper structure to eliminate the spatial aliasing and to increase the localization accuracy. In addition, the subarrays are introduced in this section.

Microphone Signal Model in Localization Algorithms
The localization algorithms are evaluated under controlled conditions to determine the robustness and accuracy. Therefore, these algorithms are examined on real and simulated data. In the evaluations of localization algorithms, the microphone signals are modeled to be similar to real scenarios. Ideal and real models are considered for microphone signals in the evaluations for localization, tracking, and estimating the number of speakers. In the ideal model, the received signal to the microphone is a delayed and weakened version of source signal, namely: where x m [n] is the received signal in the m-th microphone, r m is the distance between the sound source and m-th microphone,τ m is the delay to arrive signal from the source to m-th microphone, and v m [n] is the additive noise in the m-th microphone place. This model is considered ideal because the effects of indoor conditions and reverberations are not considered. Reverberation is an important and undesirable environmental factor that the localization algorithms are not valid without considering this effect. Therefore, the microphone signal model is designed to be similar to real conditions. The real model for microphone signals is expressed as [39]: , n] is the impulse response between the source and m-th microphone, which contains the room reverberation effect and the speech signal attenuation because of the distance between the source and the microphone. The received signal in the m-th microphone is obtained by the convolution (*) between the source signal and room impulse response, which is highly similar to real scenarios. Also, the noise is considered as the same as the real conditions. Notice that the real model is selected for the simulations to make the results comparable to real environments. The near-field and far-field assumptions are considered for the signal propagation in the environments. In the near-field assumption, the source is located near to the microphone array, where the signal arrives to microphones in a spherical shape. However, in the far-field assumption, the source signal is located far from the microphone array and the speech signal arrives to the array in a flat shape. The near-field assumption is selected for the simulations due to the consideration of the indoor condition, room dimension, and source location. Figure 1 shows the near-field model for speech signal propagation between source and microphone array.
Electronics 2020, 9,867 6 of 28 Figure 1. The near-field model for sound signal propagation in sound source localization (SSL) algorithms.

The Proposed Cuboids Nested Microphone Array for SSL
The spatial aliasing due to the inter-microphone distances in microphone arrays destroys the spectral information of speech signal and decreases the accuracy of the localization algorithms. The nested arrays are usually used in speech enhancement applications owing to their elimination of spatial aliasing. The linear microphone array was proposed for a speech enhancement algorithm in combination with adaptive noise canceller [40]. However, the linear nested array is not useful in localization applications because it does not prepare enough information for DOA estimations in different directions. In this section, a cuboids nested microphone array is proposed for the first time, which is applicable for 3D multiple simultaneous SSL. Each subarray is connected to the specific analysis filters. Figure 2 shows the block diagram of the proposed 3D localization method, where the CuNMA is shown in the left side of the diagram. The proposed CuNMA is designed to have the same characteristics for all speakers in different directions. There are enough microphone pairs in the direction of each speaker in this array [41]. Therefore, the proposed array does not make restrictions on the speaker's locations.

The Proposed Cuboids Nested Microphone Array for SSL
The spatial aliasing due to the inter-microphone distances in microphone arrays destroys the spectral information of speech signal and decreases the accuracy of the localization algorithms. The nested arrays are usually used in speech enhancement applications owing to their elimination of spatial aliasing. The linear microphone array was proposed for a speech enhancement algorithm in combination with adaptive noise canceller [40]. However, the linear nested array is not useful in localization applications because it does not prepare enough information for DOA estimations in different directions. In this section, a cuboids nested microphone array is proposed for the first time, which is applicable for 3D multiple simultaneous SSL. Each subarray is connected to the specific analysis filters. Figure 2 shows the block diagram of the proposed 3D localization method, where the CuNMA is shown in the left side of the diagram. The proposed CuNMA is designed to have the same characteristics for all speakers in different directions. There are enough microphone pairs in the direction of each speaker in this array [41]. Therefore, the proposed array does not make restrictions on the speaker's locations.
The most spectral components of speech signal are in the frequency range [50-8000] Hz, with sampling frequency F s = 16,000 Hz. The proposed CuNMA is designed to cover the frequency range [50-7600] Hz, which maintains the speech signal information. The proposed array is structured of 4 subarrays. The first subarray is designed for the highest frequency range B1 = [3800-7600] Hz. In this condition, the central frequency is f c 1 = 5700 Hz for analysis filter bank. The inter-microphone distance (d) follows the formula d ≤ λ/2 (where λ is the wavelength associated with the maximum frequency of speech signal in the related subband) to avoid the spatial aliasing. Then, d 1 is calculated as follows d 1 ≤ λ/2 = c/(2 f ) = 342 (m/s)/(2 × 7600 Hz) = 2.3 cm for the first subarray. The second subarray is structured for the frequency range B2 = [1900-3800] Hz The central frequency and inter-microphone distance are calculated as f c 2 = 2850 Hz and d 2 = 2 × d 1 ≤ 4.6 cm, respectively. The third subarray is designed to cover the frequency range B3 = [950-1900] Hz, where the inter-microphone distance is d 3 = 4 × d 1 ≤ 9.2 cm and the central frequency is f c 3 = 1425 Hz. Finally, the fourth subarray is designed for the lowest frequency range B4 = [50-950] Hz. The central frequency and inter-microphone distance are f c 4 = 500 Hz and d 4 = 8 × d 1 ≤ 18.4 cm, respectively. spatial aliasing. The linear microphone array was proposed for a speech enhancement algorithm in combination with adaptive noise canceller [40]. However, the linear nested array is not useful in localization applications because it does not prepare enough information for DOA estimations in different directions. In this section, a cuboids nested microphone array is proposed for the first time, which is applicable for 3D multiple simultaneous SSL. Each subarray is connected to the specific analysis filters. Figure 2 shows the block diagram of the proposed 3D localization method, where the CuNMA is shown in the left side of the diagram. The proposed CuNMA is designed to have the same characteristics for all speakers in different directions. There are enough microphone pairs in the direction of each speaker in this array [41]. Therefore, the proposed array does not make restrictions on the speaker's locations.   The CuNMA is designed to follow the above information. Then, the inter-microphone distance for the closest microphone pairs (1,2), (2,3), (3,4), (4,1), (5,6), (6,7), (7,8), and (8,5) is adjusted as d 1 = 2.3 cm. The inter-microphone distance is d 2 = 3.25 cm for the second subarray with microphone pairs (1,3), (2,4), (5,7), and (6,8). This process is repeated for the third subarray with microphone pairs (4,7), (8,3), (5,4), (1,8), (6,1), (5,2), (6,3), and (2,7) with the inter-microphone distance d 3 = 9.2 cm. Finally, the fourth subarray with microphone pairs (3,5), (6,4), (8,2), and (1,7) were designed with the inter-microphone distance d 4 = 9.56 cm. Therefore, the spatial aliasing is eliminated by the proposed CuNMA without any effect on the speech signal information. Figure 3 shows the proposed CuNMA with microphone pairs related to each subarray. Four subarrays in this figure are designed based on the calculated microphone distances. The CuNMA is designed to follow the above information. Then, the inter-microphone distance for the closest microphone pairs (1,2), (2,3), (3,4), (4,1), (5,6), (6,7), (7,8), and (8,5) is adjusted as Finally, the fourth subarray with microphone pairs (3,5), (6,4), (8,2), and (1,7) were designed with the inter-microphone distance 4 9.56cm d  . Therefore, the spatial aliasing is eliminated by the proposed CuNMA without any effect on the speech signal information. Figure 3 shows the proposed CuNMA with microphone pairs related to each subarray. Four subarrays in this figure are designed based on the calculated microphone distances. Each designed subarray in Figure 3 needs an analysis filter bank to prevent the spatial aliasing. The subarrays require a multirate sampling with a down sampler to design the appropriate filter for each subband. The analysis filter   i H z and down sampler i D are implemented as a multilevel tree structure, which is shown in Figure 4. Each level of this tree includes a high-pass filter (HPF)  Each designed subarray in Figure 3 needs an analysis filter bank to prevent the spatial aliasing. The subarrays require a multirate sampling with a down sampler to design the appropriate filter for each subband. The analysis filter H i (z) and down sampler D i are implemented as a multilevel tree structure, which is shown in Figure 4. Each level of this tree includes a high-pass filter (HPF)HP i (z), a low-pass filter (LPF) LP i (z), and a down sampler D i . The relationships between the analysis filter H i (z), high-pass filter HP i (z), and low-pass filter LP i (z) are expressed as [40]:  In each level of the tree, a 52-tap LPF and a 52-tap HPF are selected, which are designed with Remez method based on the finite impulse response (FIR) filters. The filters have stop band attenuation −50dB and transition band 0.0647rad/s. Figure 5 shows the frequency response for analysis filter bank implemented on the closest (subarray 1) and furthest (subarray 4) microphone pairs, respectively. Therefore, the microphone signals are prepared to enter the proposed system.

The Proposed Algorithm for Multiple SSL Based on SBGEVD
Multiple simultaneous SSL is a challenge in speech processing with unknown number of speakers in noisy and reverberant conditions. The proposed techniques for SSL should able to increase the accuracy of the algorithms with negligible computational complexity. In addition, the proposed method has to be resistant in noisy and reverberant conditions to prepare the trustable results for speaker localization. Since the speech is a W-DO signal, the main focus of the proposed method is subband processing based on the speech signal components. Subsequently, each block in the proposed algorithm in Figure 2 are explained in detail.

Spectral Estimation for Noise Reduction
Noisy speech spectral components decrease the accuracy of the speech processing algorithms In each level of the tree, a 52-tap LPF and a 52-tap HPF are selected, which are designed with Remez method based on the finite impulse response (FIR) filters. The filters have stop band attenuation −50dB and transition band 0.0647rad/s. Figure 5 shows the frequency response for analysis filter bank H i (z) related to designed microphone array. The filter H 1 (z) and H 4 (z) are implemented on the closest (subarray 1) and furthest (subarray 4) microphone pairs, respectively. Therefore, the microphone signals are prepared to enter the proposed system.  In each level of the tree, a 52-tap LPF and a 52-tap HPF are selected, which are designed with Remez method based on the finite impulse response (FIR) filters. The filters have stop band attenuation −50dB and transition band 0.0647rad/s. Figure 5 shows the frequency response for analysis filter bank implemented on the closest (subarray 1) and furthest (subarray 4) microphone pairs, respectively. Therefore, the microphone signals are prepared to enter the proposed system.

The Proposed Algorithm for Multiple SSL Based on SBGEVD
Multiple simultaneous SSL is a challenge in speech processing with unknown number of speakers in noisy and reverberant conditions. The proposed techniques for SSL should able to increase the accuracy of the algorithms with negligible computational complexity. In addition, the proposed method has to be resistant in noisy and reverberant conditions to prepare the trustable results for speaker localization. Since the speech is a W-DO signal, the main focus of the proposed method is subband processing based on the speech signal components. Subsequently, each block in the proposed algorithm in Figure 2 are explained in detail.

Spectral Estimation for Noise Reduction
Noisy speech spectral components decrease the accuracy of the speech processing algorithms

The Proposed Algorithm for Multiple SSL Based on SBGEVD
Multiple simultaneous SSL is a challenge in speech processing with unknown number of speakers in noisy and reverberant conditions. The proposed techniques for SSL should able to increase the accuracy of the algorithms with negligible computational complexity. In addition, the proposed method has to be resistant in noisy and reverberant conditions to prepare the trustable results for speaker localization. Since the speech is a W-DO signal, the main focus of the proposed method is Electronics 2020, 9, 867 9 of 28 subband processing based on the speech signal components. Subsequently, each block in the proposed algorithm in Figure 2 are explained in detail.

Spectral Estimation for Noise Reduction
Noisy speech spectral components decrease the accuracy of the speech processing algorithms such as localization and tracking. The speech spectral components are different in frequency bands. For example, the high frequency bands contain a few information of the speech signal while noise and reverberation have more effect in these components. The idea here is to propose a speech spectral estimation block for keeping the proper speech frequency components and for eliminating the undesirable components. Therefore, a spectral estimation block is considered in this section. The spectral estimation methods are divided into the parametric and non-parametric algorithms [42]. The non-parametric methods are based on the Fourier transform, which are calculated on a windowed signal before a smoothing method is implemented on these signals, such as periodogram and Welch. In parametric methods, the signal spectrum is modeled by a mathematic formula and the model's parameters are estimated from the speech signal; finally, the signal spectrum is calculated via the following models: autoregressive (AR), moving average (MA), and autoregressive-moving-average (ARMA) methods. The experiments in [43] show that the Welch method prepares a smoother spectrum in comparison with other algorithms hence, it is considered in the proposed method in this article.
The Welch method was introduced to compensate the adverse effects of periodogram. In the Welch method, data blocking with overlapping and spectral averaging is considered for spectral estimation. String x m,k is assumed for k = 0, 1, 2, . . . , N − 1. I blocks with length L are defined as: where x i m,k is the i-th block of string x m,k , L is length of the block, and D is forward step (overlap rate). Each I block is multiplied by the window w(k) and its periodogram is calculated.Ŝ i (ω) is the normalized periodogram for i-th block, which is defined as follows: The normalization factor E w is the average window power, which is expressed as: The Welch spectral estimation for power spectrum is defined as the average periodogram from I blocks, namely:Ŝ The Welch method is similar to periodogram due to the bias but it is the enhanced version in terms of variance. If the signal length is enough, the non-overlapping data are considered for the Welch method, but the maximum 50% overlap is selected in the case of short data. Figure 6 shows an example of using Welch spectral estimation on a time frame of speech signal. As seen, the signal spectral amplitude is proper in some areas and is weak in others. The selected threshold for spectral amplitude in each frame is 30% of the maximum spectral amplitude in that frame. The areas with an amplitude lower than this value are denied from the localization process.

Adaptive Wavelet Transform
As mentioned, the speech spectral components depend on the frequency bands. Therefore, by paying better attention to each frequency band increases the accuracy of the localization algorithms. Since the speech signal is W-DO, with high probability there is just one speaker in narrow frequency bands. This property is considered for implementing the sub-band processing on speech signals. The sub-band divisions can be considered uniformly but speech is a non-stationary signal and, thus, its frequency information is variable during the time. Therefore, using the adaptive wavelet packet decomposition (AWPD) is proposed. The output of the Welch method is entered to the AWPD block. This adaptive wavelet is selected in the proposed system because of high and variable resolution in low frequency components related to the speech signal information. The Welch spectral estimation block output is shown as , [ ] i  . Figure 7 shows the structure of the AWPD block for the applied method. This wavelet transform is adaptive because the number of levels and channels (p) are variable based on the estimated speech spectral components.

Adaptive Wavelet Transform
As mentioned, the speech spectral components depend on the frequency bands. Therefore, by paying better attention to each frequency band increases the accuracy of the localization algorithms. Since the speech signal is W-DO, with high probability there is just one speaker in narrow frequency bands. This property is considered for implementing the sub-band processing on speech signals. The sub-band divisions can be considered uniformly but speech is a non-stationary signal and, thus, its frequency information is variable during the time. Therefore, using the adaptive wavelet packet decomposition (AWPD) is proposed. The output of the Welch method is entered to the AWPD block. This adaptive wavelet is selected in the proposed system because of high and variable resolution in low frequency components related to the speech signal information. The Welch spectral estimation block output is shown as x m,i [n], where m is the microphone index (m = 1, . . . , 8) and i is the analysis filter index in CuNMA (i = 1, . . . , 4). Figure 7 shows the structure of the AWPD block for the applied method. This wavelet transform is adaptive because the number of levels and channels (p) are variable based on the estimated speech spectral components.

Adaptive Wavelet Transform
As mentioned, the speech spectral components depend on the frequency bands. Therefore, by paying better attention to each frequency band increases the accuracy of the localization algorithms. Since the speech signal is W-DO, with high probability there is just one speaker in narrow frequency bands. This property is considered for implementing the sub-band processing on speech signals. The sub-band divisions can be considered uniformly but speech is a non-stationary signal and, thus, its frequency information is variable during the time. Therefore, using the adaptive wavelet packet decomposition (AWPD) is proposed. The output of the Welch method is entered to the AWPD block. This adaptive wavelet is selected in the proposed system because of high and variable resolution in low frequency components related to the speech signal information. The Welch spectral estimation block output is shown as , [ ] i  . Figure 7 shows the structure of the AWPD block for the applied method. This wavelet transform is adaptive because the number of levels and channels (p) are variable based on the estimated speech spectral components.  The continues wavelet transform (CWT) extracts the translation and scale coefficients from a continues signal. The obtained signal of CWT is highly capable to be used in TF analysis. CWT of continues signal x(t) by the use of wavelet ψ(t) is defined as [44]: where ψ s,τ (t) is expressed as: where s and τ are scale and translation parameters, respectively and the over line denotes to the complex conjugate operator.Wψ(s, τ) is the wavelet coefficients and ψ(t) denotes the mother wavelet, which is selected in different forms. The discrete wavelet transform (DWT) is considered as a powerful instrument in the speech signal processing. DWT can be expressed based on the CWT formulas. The only difference in DWT is the scale and translation parameters, which are written in power 2. The s and τ are considered as s = 2 a and τ = b × 2 a where (a, b) ∈ Z 2 ; DWT is given by: The main part in DWT is the signal decomposition. The idea for decomposition in DWT is the use of LPFs and HPFs in combination with down samplers. The decomposition levels a are selected based on the desired cut-off frequency. Figure 7 shows the structure of the wavelet package decomposition, where f 0 [n], f 1 [n] and ↓ 2 are HPF, LPF, and down sampler with factor 2, respectively. The DWT input and output signals are considered asx m,i [n] andỹ m,i,j [n], respectively, where j is the number of created sub-bands by DWT on the input signal. The relation between LPF and HPF is written as: where G is the filter length with n = 0, 1, . . . , G. There is not a certain method to select the mother wavelet. The mother wavelet selection is related to the input signal and application. Daubechies, Haar, Coiflet, Biorthogonal, Symlets, Morlet, and Mexican hat are the mother wavelets [45] where the Daubechies (Db4) is selected for the DWT in the proposed system because it has a proper performance on speech signals. Therefore, the AWPD is considered for subband processing and the speech signals are divided into high frequency resolution sub-bands for GEVD algorithm.

The Subband GEVD Algorithm for DOA Estimation
The SSL algorithm in this article is based on TDOA estimations from microphone pairs. The GEVD is a novel method for source localization, which is implemented in sub-bands on microphone pairs. The eigenvector related to the smallest eigenvalue of covariance matrix contains the impulse responses between the source and microphone signals, which are all required information for TDOA estimation [46]. If the room is considered as a linear and time invariant (LTI) system, signals for each microphone pair are written as: are the impulse responses for microphones 1 and 2, respectively. Then, signal x i [n] is expressed as: Electronics 2020, 9, 867 12 of 28 As seen in Equation (13), the microphone signal x i [n] is shown by M samples of this signal and T denotes the transposing of the vector. The impulse response vector with length M is introduced as: The linear property in Equation (12) comes from the fact that x i = g i × s(i = 1, 2). Therefore, it can be written as: The GEVD algorithm estimates the TDOA between each microphone pair by the use of the covariance matrix, which is expressed as follows: where the covariance matrix components are defined as: where E is the expected value. In the next, vector u with length 2M for estimating the impulse response is proposed as follows: From Equations (12) and (16), it is clear that Ru = 0, and vector u contains the eigenvector of covariance matrix R related to eigenvalue 0. The accurate estimation of vector u is not possible in real conditions because of the non-stationary of the speech signal and background noise. We assume that noise is stationary in short frames. Then, the noise covariance matrix in the silent part of the signal is considered for updating the formulas in the noisy speech frames. The GEVD method extracts the generalized eigenvector related to the smallest generalized eigenvalue of noise covariance matrix (R b M ) and signal covariance matrix (R x M ) by the use of stochastic gradient algorithms. The noise covariance matrix (R b M ) is estimated from silence part of the signal. Therefore, it is not able to be updated in the frames by existence noise and speech simultaneously. The noise covariance matrix (R b M ) is used for updating the formulas in noisy speech part of the signal.
The generalized eigenvector is calculated by minimizing the cost function u T R x M u in an iterative process instead of updating all matrix R b M and R x M , and by use of generalized eigenvector related to the smallest generalized eigenvalue. Therefore, the error signal e[n] is processed as: where is obtained in an iterative process by least mean square error (LMS) algorithm as: where µ is the adaptation step in this algorithm. The gradient of error signal e[n] is calculated as follows: Electronics 2020, 9, 867 13 of 28 Vector u[n] is calculated by replacing Equations (19) and (21) by Equation (20) as: where the expected value of Equation (22) is calculated after convergence, then: where u[∞] is the generalized eigenvector related to the smallest generalized eigenvalue of matrixes R b M and R x M . An extra normalization step is implemented at each iteration step to prevent error propagation, which is expressed as: and, Finally, vector u is calculated as follows: Impulse responses g 1 and g 2 are calculated by estimating vector u. The signal in the source location is obtained by the deconvolution between these impulse responses and microphone signals. Also, the TDOA between each microphone pair is calculated by estimating vector u. The results section will show the convergence of the SBGEVD algorithm to the related TDOA for each sound sources. The GEVD algorithm is implemented on each sub-band (SBGEVD) and the TDOA (or DOA) values are calculated for all microphone pairs in sub-bands. The cumulative distribution function (CDF) is plotted for all calculated DOAs in each sub-band. The sub-bands with information for just one dominant speaker are selected by thresholding on these CDFs (or PFDs) and the other sub-bands with inappropriate information are denied. This thresholding is based on the SD calculation on data (DOAs) in each sub-band. The SBGEVD algorithm is iterated for each 3 continues frames and the SD is calculated for sub-band CDFs and this process is repeated for frequent time frames to cover one second of overlapped speech signal for multiple speakers. Therefore, the updated time is selected as one second for the proposed SSL algorithm. When the process is fully completed, all passed DOAs by the SD thresholding decision step are entered to the clustering for estimating the number of speakers and 3D SSL. The K-means clustering with silhouette criteria [47] are selected for the final step. The K-means algorithm is implemented on all passed DOAs of the SD decision step.

Clustering and 3D Sound Source Localization
K-means is an unsupervised clustering algorithm for data classification. The idea is to define K centroids, which are far from each other to have the best results. In the next step, each datum (the estimated DOAs by SBGEVD method and passed in the SD decision step) are associated with the closest centroids. Then, the centroids are recalculated by the associated data entire each cluster. The new centroids are calculated by averaging the existence data in each cluster. Therefore, the first data grouping is denied and the grouping step (associating data to the closest cluster) is iterated based on the new centroids. These steps are repeated until the centroids have no tangible changes. In other words, the aim is minimizing the following cost function: where DOA (m) n − C m 2 is the Euclidean distance between data DOA (m) n and centroid C m , and N k is the number of DOAs in each cluster. The main issue in K-means clustering is the estimation of the K-value. Therefore, the number of speakers is determined by estimating the K value and, finally, the 3D position of each speaker is estimated by the intersection between DOAs in each cluster. The silhouette criteria are considered for estimating the K value in the proposed method.
Silhouette criteria is a method for validating the associated data to the clusters. This method shows graphically if the data is adjusted to the proper cluster or should be associated to another cluster. We assume that data have been clustered with a specific K value. For each data i,v(i) is defined as the average dissimilarity between data i and all data in the same cluster. The Euclidean distance is selected for this measurement. The smaller v(i) value shows the better adjustment of data i in its cluster. The average dissimilarity v(i) of data i to the centroid C m is defined as the average distance between data i and all other data in the same cluster. We define c(i) as the lowest average dissimilarity of data i with other clusters that data i is not a member of them. The cluster with smallest c(i) value is selected as a neighbor cluster for data i because is the best cluster for data i if it does not adjust well in the current cluster. The silhouette value,Z(i), for data i is defined as: Z(i) can be simplified mathematically as: If v(i) c(i),Z(i) value becomes close to 1. Since v(i) is the average dissimilarity of data i to its cluster, this value shows that data i is adjusted properly to the cluster. Moreover, a large value of c(i) explains that data i has not been adjusted well with its cluster. If Z(i) is close to −1, then it is better than data i transfers to the neighbor cluster. The means that the silhouette value Z(i) is a criterion for validating an unsupervised clustering algorithm on a series of data. In the results and discussions section, the means silhouette value (MSV) is shown for various K values and signal frames. The pick position in the MSV plot refers to the number of speakers that K is the index of pick value. If the MSV curve has no maximum, the number of speakers is considered as 1. Each cluster represents one speaker. In the next, the DOAs in each cluster are plotted as a plane and the intersections are calculated. This process is repeated for all clusters to obtain the 3D position for all K speakers as: where K is the number of speakers and N k is the number of DOAs in cluster k that the 3D location of speaker k is represented as (x k ,ŷ k ,ẑ k ). Therefore, the accuracy of localization algorithm is increased by the cuboids nested microphone array, sub-band processing on GEVD method, SD thresholding by making a decision on DOA values, and intersections between DOAs in each cluster for estimating the 3D positions.

Results and Discussions
The experiments are implemented on real and simulated data for evaluating the proposed multiple simultaneous SSL algorithm. The Texas Instruments and Massachusetts Institute of Technology (TIMIT) dataset is considered for simulated data in evaluations [48]. The proposed algorithm is implemented on the overlapped speech signal. In the real condition, for about 90% of overlapped speech is from two simultaneous speakers. Almost 8% of overlapped speech is from 3 simultaneous speakers and the rest is for four and more simultaneous speakers [49]. As seen, around 98% of overlapped speech signal is just for two and three simultaneous speakers. Therefore, the simulations are implemented for the scenarios with two and three speakers. Then, one male and one female speaker (S1 and S2) are considered for two simultaneous speakers and two males, and one female speaker are selected for three overlapped speakers. Also, the evaluations are implemented on real data, which are recorded at a speech processing laboratory, at the Universidad Tecnológica Metropolitana, to compare the results between the simulated and real conditions. The microphone signals are recorded in the acoustic room by a cuboids nested 8-microphones array, located in the middle of room. The microphones are connected to a recording speech system, which captures the synchronized signals simultaneously. Also, the connected speakers to separate computers are considered instead of humans in the recording room for better control of the transmitted signal power. The speech signals are played by the speakers in front of CuNMA and they are recorded by the microphones. In the simulations, 45 s speech signal is considered for each speaker that 27.2 s of the signals have overlap between 2 speakers and 20.6 s of the overlapped signals are for three simultaneous speakers. Figure 8 shows the speech signal for each speaker jointly with two and three overlapped signals.
Electronics 2020, 9,867 15 of 28 overlapped speech is from two simultaneous speakers. Almost 8% of overlapped speech is from 3 simultaneous speakers and the rest is for four and more simultaneous speakers [49]. As seen, around 98% of overlapped speech signal is just for two and three simultaneous speakers. Therefore, the simulations are implemented for the scenarios with two and three speakers. Then, one male and one female speaker (S1 and S2) are considered for two simultaneous speakers and two males, and one female speaker are selected for three overlapped speakers. Also, the evaluations are implemented on real data, which are recorded at a speech processing laboratory, at the Universidad Tecnológica Metropolitana, to compare the results between the simulated and real conditions. The microphone signals are recorded in the acoustic room by a cuboids nested 8-microphones array, located in the middle of room. The microphones are connected to a recording speech system, which captures the synchronized signals simultaneously. Also, the connected speakers to separate computers are considered instead of humans in the recording room for better control of the transmitted signal power. The speech signals are played by the speakers in front of CuNMA and they are recorded by the microphones. In the simulations, 45 s speech signal is considered for each speaker that 27.2 s of the signals have overlap between 2 speakers and 20.6 s of the overlapped signals are for three simultaneous speakers. Figure 8 shows the speech signal for each speaker jointly with two and three overlapped signals. The simulation conditions are adjusted to be similar to real scenarios. Therefore, the room dimension is set as (350, 300, 400) cm. The three speakers are located at (60,220,170) cm (S1), (310,245,175) cm (S2) and (95,75,180) cm (S3), respectively. In addition, the CuNMA is placed in the middle of room that the center of the array is located at (175,150,120) cm. Figure 9 shows a view of the room with locations of the speakers and microphone array. The microphone positions, speakers' locations, and room dimensions are summarized in Table 1. The simulation conditions are adjusted to be similar to real scenarios. Therefore, the room dimension is set as (350, 300, 400) cm. The three speakers are located at (60,220,170) cm (S1), (310,245,175) cm (S2) and (95,75,180) cm (S3), respectively. In addition, the CuNMA is placed in the middle of room that the center of the array is located at (175,150,120) cm. Figure 9 shows a view of the room with locations of the speakers and microphone array. The microphone positions, speakers' locations, and room dimensions are summarized in Table 1.  Noise and reverberation are two important factors that decrease the accuracy of localization algorithms. These two factors are observed clearly in the real conditions. Therefore, we should consider these factors in the simulated scenarios. White Gaussian noise with variable power is selected for simulations to create the noisy signal with different signal-to-noise ratio (SNR). This Gaussian noise models the real noisy environment in the simulated data. The Image model is selected for the simulations to prepare the reverberation effect as same as the real conditions [50]. The Image model creates the reverberations in the indoor conditions similar to the real environments with high accuracy. This model estimates the room impulse response between source and microphone by considering room dimensions, microphone location, source position, impulse response length, sampling frequency, surface reflection coefficients, and reverberation time ( 60 RT ). The received signal in the microphone place is generated by the convolution between the generated room impulse response and source signal. The room reverberation time is easily changeable in simulations, but it is hard to change 60 RT in real conditions. The absorbent panels are used on the walls and floors to change the room reverberation time in real scenarios.   Noise and reverberation are two important factors that decrease the accuracy of localization algorithms. These two factors are observed clearly in the real conditions. Therefore, we should consider these factors in the simulated scenarios. White Gaussian noise with variable power is selected for simulations to create the noisy signal with different signal-to-noise ratio (SNR). This Gaussian noise models the real noisy environment in the simulated data. The Image model is selected for the simulations to prepare the reverberation effect as same as the real conditions [50]. The Image model creates the reverberations in the indoor conditions similar to the real environments with high accuracy. This model estimates the room impulse response between source and microphone by considering room dimensions, microphone location, source position, impulse response length, sampling frequency, surface reflection coefficients, and reverberation time (RT 60 ). The received signal in the microphone place is generated by the convolution between the generated room impulse response and source signal. The room reverberation time is easily changeable in simulations, but it is hard to change RT 60 in real conditions. The absorbent panels are used on the walls and floors to change the room reverberation time in real scenarios. The RT 60 value changes by moving the location, increasing and decreasing the number of panels.
The experiments are implemented on environmental scenarios. Then, three main scenarios are designed for the evaluations. The first scenario is reverberant environment with RT 60 = 650 ms and SNR = 20 dB. The noisy scenario has dominant noise against reverberation as SNR = 5 dB and RT 60 = 250 ms. Finally, the most challenging scenario is noisy-reverberant with SNR = 5 dB and RT 60 = 650 ms. Also, the experiments are done on fixed-SNR values and variable RT 60 and vice versa for evaluating the robustness of the proposed method during the SNR and RT 60 changes. The simulations are done on MATLAB software version 2019b (MathWorks, Natick, MA, USA). Also, the experiments are implemented on PC with CPU core i7-7700 (Intel, Santa Clara, CA, USA), 4.2 GHz and 32 GB RAM. The Hamming window with 60ms length and 50% overlap is considered to prepare the constant frames of speech signal that the speaker positions are fixed during this short time. Therefore, the sufficient and trustable information is prepared for the proposed algorithm. The proposed CuNMA-SBGEVD method is compared with HiGRID [33], PCSF [35], TF-wise SSC [37] and SSM-DNN [36] methods to show the precision and robustness of the estimated locations in comparison with previous works. Also, the mean absolute estimation error (MAEE) criteria between the true (x k , y k , z k ) and estimated (x k ,ŷ k ,ẑ k )3D locations, in cm, is calculated for N f continuous frames and for speaker k to compare results in noisy and reverberant scenarios.
One of the main parts of the proposed system is the SBGEVD block. The final localization results are related directly to the convergence of DOA values in this part. Figure 10 shows the convergence curve for the SBGEVD algorithm in sub-band 1.8-2 kHz and for noisy, reverberant, and noisy-reverberant scenarios for the first speaker. As seen in this Figure, the SBGEVD function is converged to the correct DOA 59 o for the first speaker in all three scenarios. These correct convergences are due to the sub-band processing with spectral estimation in the pre-processing step. The speed of convergence in noisy scenario is more than the reverberant scenario and noisy-reverberant scenario, which has the lowest speed of convergence because of the existence of both undesirable factors at the same time. Also, in the sub-bands with simultaneous speakers, the DOA is not converged to the correct value and it makes errors in the estimated location. Therefore, the sub-bands with simultaneous speakers are denied in the next step.
Electronics 2020, 9, 867 17 of 28 for evaluating the robustness of the proposed method during the SNR and 60 RT changes. The simulations are done on MATLAB software version 2019b (MathWorks, Natick, MA, USA). Also, the experiments are implemented on PC with CPU core i7-7700 (Intel, Santa Clara, CA, USA), 4.2 GHz and 32 GB RAM. The Hamming window with 60ms length and 50% overlap is considered to prepare the constant frames of speech signal that the speaker positions are fixed during this short time. Therefore, the sufficient and trustable information is prepared for the proposed algorithm. The proposed CuNMA-SBGEVD method is compared with HiGRID [33], PCSF [35], TF-wise SSC [37] and SSM-DNN [36] methods to show the precision and robustness of the estimated locations in comparison with previous works. Also, the mean absolute estimation error (MAEE) criteria between the true ( , , ) k k k x y z and estimatedˆˆ( , , ) k k k x y z 3D locations, in cm, is calculated for f N continuous frames and for speaker k to compare results in noisy and reverberant scenarios.
One of the main parts of the proposed system is the SBGEVD block. The final localization results are related directly to the convergence of DOA values in this part. Figure 10 shows the convergence curve for the SBGEVD algorithm in sub-band 1.8-2 kHz and for noisy, reverberant, and noisyreverberant scenarios for the first speaker. As seen in this Figure, the SBGEVD function is converged to the correct DOA o 59 for the first speaker in all three scenarios. These correct convergences are due to the sub-band processing with spectral estimation in the pre-processing step. The speed of convergence in noisy scenario is more than the reverberant scenario and noisy-reverberant scenario, which has the lowest speed of convergence because of the existence of both undesirable factors at the same time. Also, in the sub-bands with simultaneous speakers, the DOA is not converged to the correct value and it makes errors in the estimated location. Therefore, the sub-bands with simultaneous speakers are denied in the next step. The CDF and PDF calculations are another block in the proposed CuNMA-SBGEVD algorithm, which are implemented on the estimated DOAs from each sub-band. The CDFs (and finally PDFs) are calculated to show a robust distribution of DOAs in sub-bands. Figure 11 shows the CDFs and PDFs for sub-bands 0.8-1 kHz and 2-2.5 kHz and for three continuous time frames. As shown in Figure 11a, the PDF is closer to Gaussian distribution and it means the estimated DOAs in this subband are centralized around a specific point (the DOA for first speaker o 59 ). The calculated SD in this condition is a small value. Figure 11b shows the CDF and PDF for a sub-band, which contains the mixing of speaker information and the estimated DOA does not present a correct direction. Therefore, the PDF curve is closer to uniform distribution and SD is a larger value in this condition. This means that the estimated DOAs in this sub-band are not trustable and will be denied. Based on the experiments, the threshold value for the SD to accept or reject the DOAs in sub-bands is The CDF and PDF calculations are another block in the proposed CuNMA-SBGEVD algorithm, which are implemented on the estimated DOAs from each sub-band. The CDFs (and finally PDFs) are calculated to show a robust distribution of DOAs in sub-bands. Figure 11 shows the CDFs and PDFs for sub-bands 0.8-1 kHz and 2-2.5 kHz and for three continuous time frames. As shown in Figure 11a, the PDF is closer to Gaussian distribution and it means the estimated DOAs in this subband are centralized around a specific point (the DOA for first speaker 59 o ). The calculated SD in this condition is a small value. Figure 11b shows the CDF and PDF for a sub-band, which contains the mixing of speaker information and the estimated DOA does not present a correct direction. Therefore, the PDF curve is closer to uniform distribution and SD is a larger value in this condition. This means that the estimated DOAs in this sub-band are not trustable and will be denied. Based on the experiments, the threshold value for the SD to accept or reject the DOAs in sub-bands is ±10 o . Therefore, the estimated DOAs for the sub-band with the SD of PDF function under ±10 o are passed to the clustering step and the rest are denied. Then, the proper estimated DOAs are considered for the clustering process to decrease the localization error. Finally, the intersection between passed DOAs in each cluster represent the 3D position of each speaker. Finally, the last step of the proposed method is K-means clustering jointly with silhouette criteria for estimating the number of speakers and 3D SSL. Therefore, simulations are implemented on three areas of speech signal for single, two, and three simultaneous speakers to show the superiority of the proposed method. Figure 12 shows the results for clustering and silhouette criteria in 20 dB SNR  and 60 250 ms RT  (low level of noise and reverberation). Figure 12a represents the approved DOAs from thresholding on sub-band PDFs. The K-means algorithm is implemented on all of these data. Also, the silhouette criteria are considered to select the best K for all regions of data. Figure 12b-d shows the results for silhouette criteria in the time domain. Figure 12b represents the MSV curve for the first region (left side) of Figure 12a, which shows the existence of just one speaker because of not having any outstanding maximum in different K values. The 3D source location is estimated by this clustering and the intersection between DOAs as (51,229,168) cm with 12.88 cm error in comparison with correct location (60,220,170) cm. This process is iterated for the second region (center) in Figure  12a that the MSV curve is shown in Figure 12c. As shown in this Figure Finally, the last step of the proposed method is K-means clustering jointly with silhouette criteria for estimating the number of speakers and 3D SSL. Therefore, simulations are implemented on three areas of speech signal for single, two, and three simultaneous speakers to show the superiority of the proposed method. Figure 12 shows the results for clustering and silhouette criteria in SNR = 20 dB and RT 60 = 250 ms(low level of noise and reverberation). Figure 12a represents the approved DOAs from thresholding on sub-band PDFs. The K-means algorithm is implemented on all of these data. Also, the silhouette criteria are considered to select the best K for all regions of data. Figure 12b-d shows the results for silhouette criteria in the time domain. Figure 12b represents the MSV curve for the first region (left side) of Figure 12a, which shows the existence of just one speaker because of not having any outstanding maximum in different K values. The 3D source location is estimated by this clustering and the intersection between DOAs as (51,229,168) cm with 12.88 cm error in comparison with correct location (60,220,170) cm. This process is iterated for the second region (center) in Figure 12a that the MSV curve is shown in Figure 12c. As shown in this Figure Figure 12a, which is shown in Figure 12d with maximum in K = 2 and existence of two speakers (speakers 1 and 2) in this region. The speakers' locations are estimated by the intersection between DOAs in each cluster, which shows the location (58,227,161) cm and (302,253,181) cm for the first and second speakers, respectively. The correct locations for these two speakers are (60,220,170) cm (S1) and (310,245,175) cm (S2), respectively, which have11.57 cm and 12.8 cm errors with the estimated locations.
Electronics 2020, 9, 867 20 of 28 silhouette criteria is implemented for the last region of Figure 12a, which is shown in Figure 12d with maximum in K=2 and existence of two speakers (speakers 1 and 2) in this region. The speakers' locations are estimated by the intersection between DOAs in each cluster, which shows the location (58,227,161) cm and (302,253,181) cm for the first and second speakers, respectively. The correct locations for these two speakers are (60,220,170) cm (S1) and (310,245,175) cm (S2), respectively, which have11.57 cm and 12.8 cm errors with the estimated locations.  Table 2 shows the results for the proposed CuNMA-SBGEVD algorithm in comparison with HiGRID, PCSF, TF-wise SSC, and SSM-DNN methods in reverberant, noisy, and noisy-reverberant scenarios on real and simulated data for two simultaneous speakers. The MAEE criteria, in cm, is considered for measuring the error in the estimating of 3D speaker locations on 20 continuous frames. Based on the MAEE values in this table, the noisy environment has better results in comparison with two other scenarios. Also, the simulated data have less MAEE in comparison with real data because of the better control of environmental parameters (noise and reverberation). In addition, the proposed CuNMA-SBGEVD has a better accuracy in most of the scenarios in comparison with previous works. In scenario 3, for real data, the SSM-DNN method has a slight smaller MAEE, which cannot be extended to all scenarios. Table 2. The mean absolute estimation error (MAEE) comparison between the proposed cuboids nested microphone array (CuNMA)-subband generalized eigenvalue decomposition(SBGEVD) and hierarchical grid (HiGRID), perpendicular cross-spectra fusion (PCSF), time-frequency wise spatial spectrum clustering (TF-wise SSC), and spectral source model-deep neural network (SSM-DNN) methods for reverberant, noisy, and noisy-reverberant scenarios on real and simulated data for 2 simultaneous speakers.

Simulated Data
Speaker S1 S2 S1 S2 S1 S2 S1 S2 S1 S2 Scenario 1  51  53  46  50  45  49  43  44  36  38  Scenario 2  40  44  37  39  34  38  32  36  25  29  Scenario 3  59  65  55  61  54  55  48  51  41  43 Real Data Speaker S1 S2 S1 S2 S1 S2 S1 S2 S1 S2  Figure 13 shows the averaged MAEE results for the proposed CuNMA-SBGEVD in comparison with HiGRID, PCSF, TF-wise SSC, and SSM-DNN methods for different range of SNRs and RT 60 on real and simulated data for 2 simultaneous speakers. Figure 13a represents the averaged MAEE for SNR = 5 dB and 0 < RT 60 < 700 ms. As seen, the proposed method has less MAEE in comparison with other previous works on real and simulated data. For example, in RT 60 = 700 ms, the averaged MAEE for the proposed CuNMA-SBGEVD method on the simulated data is 44 cm in comparison with 65 cm for HiGRID, 60 cm for PCSF, 56 cm for TF-wise SSC, and 51 cm for SSM-DNN, which shows the higher accuracy of the proposed method in contrast to other methods. Also, Figure 13b shows these results for RT 60 = 650 ms and −10 < SNR < 20 dB. As shown, the proposed method has a better accuracy in all of the conditions in comparison with other works. For example, in SNR = 5 dB, the averaged MAEE is 42 cm for the proposed CuNMA-SBGEVD in comparison with 62 cm for HiGRID, 58 cm for PCSF, 54 cm for TF-wise SSC, and 49 cm for SSM-DNN algorithms for simulated data, thus showing the superiority of the proposed method.  Table 3 represents the MAEE results for the proposed CuNMA-SBGEVD method in comparison with HiGRID, PCSF, TF-wise SSC, and SSM-DNN for 3 simultaneous speakers in reverberant, noisy, and noisy-reverberant scenarios on real and simulated data. The results are obtained from 20 continuous frames of overlapped speech signals. The results show the accuracy of the proposed method in the 3D localization in comparison with previous works. Also, the results in noisy scenarios are better than reverberant and noisy-reverberant conditions in all methods. In addition, the accuracy of the localization is higher for the closer speakers to the CuNMA because of the signal high power and low reverberation. The results for the simulated data are better than the real data because there is more accurate control of undesirable conditions in the simulations. Proposed CuNMA-SBGEVD Simulated Data Speaker S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 Scenario 1  55 54 59 53 56 58 48  50  54  43  48 47  41  39  44  Scenario 2  43 46 47 39 43 44 39  42  40  34  37 39  28  29  32  Scenario 3  65 67 64 58 63 61 57  56  59  54  56 54  39  42 49 Real Data Speaker S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 Scenario 1  65 63 69 59 55 61 53  51  56  45  53 50  38  45  43  Scenario 2  45 47 51 43 44 46 38  42  44  35  42 43  29  34  35  Scenario 3  68 71 72 59 65 68 59  65  61  53  57 59  42 51 50 Figure 14 represents the averaged MAEE results for the proposed CuNMA-SBGEVD method and HiGRID, PCSF, TF-wise SSC, and SSM-DNN algorithms for 3 simultaneous speakers and various range of SNRs and 60 RT . Figure 14a shows the averaged MAEE for 5 dB SNR  and 60 0 700ms As seen, the proposed method has high accuracy on real and simulated data in comparison with other previous works. For example, in 60 700ms RT  ,the averaged MAEE for the proposed CuNMA-SBGEVD method on the simulated data is 47 cm, which is more accurate in comparison with averaged for RT 60 = 650 ms and −10 < SNR < 20 dB. Table 3 represents the MAEE results for the proposed CuNMA-SBGEVD method in comparison with HiGRID, PCSF, TF-wise SSC, and SSM-DNN for 3 simultaneous speakers in reverberant, noisy, and noisy-reverberant scenarios on real and simulated data. The results are obtained from 20 continuous frames of overlapped speech signals. The results show the accuracy of the proposed method in the 3D localization in comparison with previous works. Also, the results in noisy scenarios are better than reverberant and noisy-reverberant conditions in all methods. In addition, the accuracy of the localization is higher for the closer speakers to the CuNMA because of the signal high power and low reverberation. The results for the simulated data are better than the real data because there is more accurate control of undesirable conditions in the simulations. Speaker S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 Scenario 1  55  54  59  53  56  58  48  50  54  43  48  47  41  39  44  Scenario 2  43  46  47  39  43  44  39  42  40  34  37  39  28  29  32  Scenario 3  65  67  64  58  63  61  57  56  59  54  56  54  39  42  49 Real Data Speaker S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 S1 S2 S3 Scenario 1  65  63  69  59  55  61  53  51  56  45  53  50  38  45  43  Scenario 2  45  47  51  43  44  46  38  42  44  35  42  43  29  34  35  Scenario 3  68  71  72  59  65  68  59  65  61  53  57  59  42 51 50 Figure 14 represents the averaged MAEE results for the proposed CuNMA-SBGEVD method and HiGRID, PCSF, TF-wise SSC, and SSM-DNN algorithms for 3 simultaneous speakers and various range of SNRs and RT 60 . Figure 14a shows the averaged MAEE for SNR = 5 dB and 0 < RT 60 < 700 ms. As seen, the proposed method has high accuracy on real and simulated data in comparison with other previous works. For example, in RT 60 = 700 ms,the averaged MAEE for the proposed CuNMA-SBGEVD method on the simulated data is 47 cm, which is more accurate in comparison with averaged MAEE 69 cm for HiGRID, 64 cm for PCSF, 59 cm for TF-wise SSC, and 56 cm for SSM-DNN. In addition, Figure 14b represents the results for RT 60 = 650 ms and −10 < SNR < 20 dB. This experiment evaluates the noise effect on the proposed SSL method. As shown, the proposed method has the lowest averaged MAEE and higher precision in SSL in comparison with previous works. For example, in SNR = 5 dB, the averaged MAEE for the proposed CuNMA-SBGEVD on simulated data is 45 cm in comparison with 65 cm in HiGRID, 62 cm in PCSF, 55 cm in TF-wise SSC, and 53 cm in SSM-DNN algorithms. Table 4 compares the computational complexity of the proposed CuNMA-SBGEVD in comparison with other previous works based on the run-time of MATLAB software in second. The experiments are implemented on real data for two and three simultaneous speakers. As seen, the HiGRID method has the highest computational complexity values because of searching the candidate places by the indoor conditions. After this method, PCSF and SSM-DNN algorithms have lower complexities in comparison with HiGRID but the SSM-DNN method still has the higher complexity because of using training and testing steps in the DNN structure. The complexity of the proposed method is similar to the TF-wise SSC algorithm, but in some conditions, the proposed method has less complexity. This can be justified by the use of spectral estimation blocks to remove the undesirable spectral contents of the speech signal and also, eliminating the improper DOAs in the SD decision on PDFs. Therefore, the complexity of the proposed method is less than other previous works.
Electronics 2020, 9, 867 23 of 28 MAEE 69 cm for HiGRID, 64 cm for PCSF, 59 cm for TF-wise SSC, and 56 cm for SSM-DNN. In addition, Figure 14b represents the results for 60 650 ms RT  and 10 20dB SNR    . This experiment evaluates the noise effect on the proposed SSL method. As shown, the proposed method has the lowest averaged MAEE and higher precision in SSL in comparison with previous works. For example, in 5 dB SNR  ,the averaged MAEE for the proposed CuNMA-SBGEVD on simulated data is 45 cm in comparison with 65 cm in HiGRID, 62 cm in PCSF, 55 cm in TF-wise SSC, and 53 cm in SSM-DNN algorithms. Table 4 compares the computational complexity of the proposed CuNMA-SBGEVD in comparison with other previous works based on the run-time of MATLAB software in second. The experiments are implemented on real data for two and three simultaneous speakers. As seen, the HiGRID method has the highest computational complexity values because of searching the candidate places by the indoor conditions. After this method, PCSF and SSM-DNN algorithms have lower complexities in comparison with HiGRID but the SSM-DNN method still has the higher complexity because of using training and testing steps in the DNN structure. The complexity of the proposed method is similar to the TF-wise SSC algorithm, but in some conditions, the proposed method has less complexity. This can be justified by the use of spectral estimation blocks to remove the undesirable spectral contents of the speech signal and also, eliminating the improper DOAs in the SD decision on PDFs. Therefore, the complexity of the proposed method is less than other previous works.    In summary, the evaluation results on real and simulated data for two and three simultaneous speakers show the superiority of the proposed method in comparison with other previous works. The high accuracy, low computational complexity, and robustness of the CuNMA-SBGEVD algorithm create the proper conditions for using the proposed method for 3D localization and for estimating the number of speakers in real conditions.

Conclusions
The multiple simultaneous SSL of overlapped speech signals is one of the main challenges in speech signal processing. Also, noise and reverberation as undesirable environmental factors reduce the accuracy of localization algorithms. Some methods localize the speakers' locations based on the energy and some others based on the TDOAs. The localization method is selected based on the accuracy and computational complexity. In addition, the microphone array increases the accuracy of the SSL algorithms by the information redundancy, but the spatial aliasing decreases the accuracy because of inter-microphone distances. In this article, first, a cuboids nested microphone array is proposed which eliminates the spatial aliasing by having proper inter-microphone distances in all microphone pairs and prepares the high quality signals for the SSL algorithm. In most conditions, the speech spectrum components are centralized in some specific frequency bands and the other bands do not have suitable information. Therefore, the use of the Welch spectral estimation method is proposed for keeping the components with proper spectrums and eliminating the rest of areas. Therefore, the improper information is removed from the localization procedure. Speech signals provide more information in low frequency components in comparison with high frequencies. The Wavelet transform is proposed as a proper method for sub-band processing in the proposed SSL algorithm. The low frequency components of speech signal are considered deeply by this sub-band processing. The GEVD algorithm is implemented on sub-bands for estimating the DOAs for each nested microphone pair. The sub-bands with just one speaker have DOA values which are centralized around a specific point, but the sub-bands with multiple speakers do not have any specific distribution of estimated DOAs. Therefore, the PDF for DOAs is calculated in each sub-band and the DOAs are passed for sub-bands with a SD smaller than the threshold and the other DOAs are denied. Finally, the K-means clustering with silhouette criteria are considered for the classification the DOAs (estimating the number of speakers) and the 3D speakers locations are estimated by the intersection between DOAs in each cluster. The accuracy and computational complexity of the proposed CuNMA-SBGEVD method is compared with HiGRID, PCSF, TF-wise SSC, and SSM-DNN algorithms on real and simulated data for two and three simultaneous speakers. The results show the superiority of the proposed method in comparison with other previous works. As we reported in the article, the MATLAB software were used for the simulations on the real and simulated data. Table 4 shows that the implemented methods still are far from the real-time implementations. The MATLAB software is considered for the implementations because it is user friendly and more applicable for the use of existing functions and the preparation of figures and tables in the results section. Otherwise, alternative software such as Python and C, or the implementation of a digital signal processor hardware, are the options more suitable for real-time implementation. The idea for future work is to implement the method in such hardware to be implementable for real-time applications.