Sound Source Localization Based on Multi-Channel Cross-Correlation Weighted Beamforming

Beamforming and its applications in steered-response power (SRP) technology, such as steered-response power delay and sum (SRP-DAS) and steered-response power phase transform (SRP-PHAT), are widely used in sound source localization. However, their resolution and accuracy still need improvement. A novel beamforming method combining SRP and multi-channel cross-correlation coefficient (MCCC), SRP-MCCC, is proposed in this paper to improve the accuracy of direction of arrival (DOA). Directional weight (DW) is obtained by calculating the MCCC. Based on DW, suppressed the non-incoming wave direction and gained the incoming wave direction to improve the beamforming capabilities. Then, sound source localizations based on the dual linear array under different conditions were simulated. Compared with SRP-PHAT, SRP-MCCC has the advantages of high positioning accuracy, strong spatial directivity and robustness under the different signal–noise ratios (SNRs). When the SNR is −10 dB, the average positioning error of the single-frequency sound source at different coordinates decreases by 5.69%, and that of the mixed frequency sound sources at the same coordinate decreases by 5.77%. Finally, the experimental verification was carried out. The results show that the average error of SRP-MCCC has been reduced by 8.14% and the positioning accuracy has been significantly improved, which is consistent with the simulation results. This research provides a new idea for further engineering applications of sound source localization based on beamforming.


Introduction
Sound source localization techniques have a wide range of application prospects in civil and military systems, such as intelligent medical systems, security monitoring and sonar detection [1][2][3][4][5]. Existing sound source localization techniques can be divided into the subspace, time delay estimation and beamforming. The subspace approach uses the orthogonality between the signal and noise subspaces to determine the waveform direction, including multiple signal classification (MUSIC) and estimating signal parameters via rotational invariance techniques (ESPRIT) [6,7]. Their performance is heavily dependent on the covariance matrix estimation, which is influenced by the signal-to-noise ratio (SNR). Hu et al. proposed an improved MUSIC algorithm to calculate the spatial spectrum and achieve azimuth estimation [8]. Herzog et al. developed a novel EB-ESPRIT method to estimate the incoming wave direction of a sound source [9]. The direction estimation accuracy of studies [8] and [9] is higher than that of MUSIC and ESPRIT, respectively. However, the accuracy is still significantly reduced at a lower SNR. The time delay estimation method achieves the source location based on the arrival time difference [10]. In [11], sound source localization in an indoor environment is completed by using two dual-element arrays to estimate time delay. However, the time delay estimation method is susceptible to noise.
When the number of array elements is increased, there is redundant information in the signals. The multi-channel cross-correlation coefficient (MCCC) method in [12] can improve the robustness of delay estimation through multi-element array signals. Beamforming is to obtain the direction of the sound source by summing weighted array signals, and is classified into frequency domain beamforming and time domain beamforming [13,14]. In frequency domain beamforming, many methods have been proposed to improve its performance [15,16]. Reference [15] improved the robustness by adding norm constraints and spatial smoothing techniques. Reference [16] proposed nested arrays to improve beamforming performance. Time domain beamforming is compared with frequency domain beamforming in [17]. The two beamforming methods have similar performance. Time domain beamforming is a natural broadband method which is suitable for single-frequency and multi-frequency signals, and it does not require repeated multiple frequency processing. In time domain beamforming, the steered-response power (SRP) is commonly used. Steered-response power delay and sum (SRP-DAS) is used for direction estimation based on microphone arrays. The steered-response power phase transform (SRP-PHAT) algorithm is an optimization of the SRP-DAS, which is easy to implement and has stronger robustness than SRP-DAS by whitening the signals [18,19]. The SRP-PHAT algorithm as time domain beamforming has been widely used in target tracking and distributed localization [20][21][22]. Salvati et al. reported the SRP-PHAT modification algorithm, which can speed up the operation [23]. However, the directivity based on SRP-PHAT is not outstanding in azimuth estimation, and the localization accuracy still needs further improvement.
Therefore, combining the advantages of MCCC and SRP, a new beamforming method (SRP-MCCC) is proposed in this paper. In this method, the wave direction weight (DW) is calculated by the MCCC, which adjusts the SRP value to enhance the directivity of microphone arrays and improve spatial resolution. In this paper, Section 2 describes the positioning principle of the proposed method. Then, the sound source localization simulation is reviewed in Section 3. Section 4 verifies the performance of the proposed method under experimental conditions. Finally, conclusions are given in Section 5. Figure 1 shows the calculating flow of the position. Suppose the numbers of arrays and each array elements are M and N, respectively. For each array, the MCCC is evaluated by the N-elements' signals. The DW is constructed by using redundant information of MCCC. After obtaining the DW, the weighted beamforming is performed, and finally, the relative direction (θ I , θ I I , . . . , θ M ) is found. When M ≥ 2, the source position can be calculated from (θ I , θ I I , . . . , θ M ).

Positioning Principle
Micromachines 2022,13,x FOR PEER REVIEW 2 of 11 element arrays to estimate time delay. However, the time delay estimation method is susceptible to noise. When the number of array elements is increased, there is redundant information in the signals. The multi-channel cross-correlation coefficient (MCCC) method in [12] can improve the robustness of delay estimation through multi-element array signals. Beamforming is to obtain the direction of the sound source by summing weighted array signals, and is classified into frequency domain beamforming and time domain beamforming [13,14]. In frequency domain beamforming, many methods have been proposed to improve its performance [15,16]. Reference [15] improved the robustness by adding norm constraints and spatial smoothing techniques. Reference [16] proposed nested arrays to improve beamforming performance. Time domain beamforming is compared with frequency domain beamforming in [17]. The two beamforming methods have similar performance. Time domain beamforming is a natural broadband method which is suitable for single-frequency and multi-frequency signals, and it does not require repeated multiple frequency processing. In time domain beamforming, the steered-response power (SRP) is commonly used. Steered-response power delay and sum (SRP-DAS) is used for direction estimation based on microphone arrays. The steered-response power phase transform (SRP-PHAT) algorithm is an optimization of the SRP-DAS, which is easy to implement and has stronger robustness than SRP-DAS by whitening the signals [18,19].
The SRP-PHAT algorithm as time domain beamforming has been widely used in target tracking and distributed localization [20][21][22]. Salvati et al. reported the SRP-PHAT modification algorithm, which can speed up the operation [23]. However, the directivity based on SRP-PHAT is not outstanding in azimuth estimation, and the localization accuracy still needs further improvement. Therefore, combining the advantages of MCCC and SRP, a new beamforming method (SRP-MCCC) is proposed in this paper. In this method, the wave direction weight (DW) is calculated by the MCCC, which adjusts the SRP value to enhance the directivity of microphone arrays and improve spatial resolution. In this paper, Section 2 describes the positioning principle of the proposed method. Then, the sound source localization simulation is reviewed in Section 3. Section 4 verifies the performance of the proposed method under experimental conditions. Finally, conclusions are given in Section 5. Figure 1 shows the calculating flow of the position. Suppose the numbers of arrays and each array elements are M and N, respectively. For each array, the MCCC is evaluated by the N-elements' signals. The DW is constructed by using redundant information of MCCC. After obtaining the DW, the weighted beamforming is performed, and finally, the relative direction (θ , θ , . . . , θ ) is found. When M ≥ 2, the source position can be calculated from (θ , θ , . . . , θ ).  Figure 1. Positioning flowchart.

Signal Model
The far-field signal propagation and the N-element linear microphone array model are shown in Figure 2. θ 0 represents the direction from the far-field source to the array, and d i (i = 1, 2 . . . . . . N − 1) denotes the spacing between the two array elements in the array. The far-field signal propagation and the N-element linear microphone array model are shown in Figure 2. θ represents the direction from the far-field source to the array, and d (i = 1, 2……N-1) denotes the spacing between the two array elements in the array. Assume that the source signal is s(n), which is a time series. The signals received by the N-element linear microphone array can be expressed as ( ).
where x (n)(i = 1, 2, . . . , N) represents the signal received by the i-th microphone, α is the attenuation coefficient of the signal received by the i-th microphone, t is the propagation time from the sound source localization to the i-th microphone and v (n) is the noise signal received by the i-th microphone.
Taking the first microphone as the reference, the aligned signal ( , ) can be written as: where s' (n − τ ) represents the i-th signal after alignment. τ (i = 1, 2, . . . , N) is the time delay from the i-th microphone to the first microphone, and its value is related to the position of the sound source and the microphone array structure, where τ = 0. α' is the relative attenuation coefficient, and it is calculated by For a linear array, the following relationship exists between τ and d (the distance between the n-th and (n + 1)-th microphone): Assume that the source signal is s(n), which is a time series. The signals received by the N-element linear microphone array can be expressed as X(n).
where x i (n)(i = 1, 2, . . . , N) represents the signal received by the i-th microphone, α i is the attenuation coefficient of the signal received by the i-th microphone, t i is the propagation time from the sound source localization to the i-th microphone and v i (n) is the noise signal received by the i-th microphone.
Taking the first microphone as the reference, the aligned signal y(n, τ) can be written as: where s i (n − τ i ) represents the i-th signal after alignment. τ i (i = 1, 2, . . . , n) is the time delay from the i-th microphone to the first microphone, and its value is related to the position of the sound source and the microphone array structure, where τ 1 = 0. α i is the relative attenuation coefficient, and it is calculated by For a linear array, the following relationship exists between τ i and d n (the distance between the n-th and (n + 1)-th microphone):

Direction Weight
The Pearson coefficient of the normalized signals is used to replace the value of the correlation function [24,25]. The correlation coefficient matrix of different time delays is ρ(τ): is the correlation coefficient between the i-th and the j-th alignment signal, which can be expressed as: where σ i = ∑ x 2 i is the energy of the i-th signal and · indicates the inner product. Observing Equation (7), it can be deduced that r ij is the correlation coefficient between the i-th signal and the j-th signal. r iv j represents the correlation coefficient between the i-th signal and the j-th noise, and r jv i is the correlation coefficient between the j-th signal and the i-th noise. r v i v j is expressed as the correlation coefficient between the i-th noise and the j-th noise. It can be seen that when s i (n − τ i ) = s j (n − τ j ), r ij has the maximum value. The h DW (τ) is constructed by using redundant information of the MCCC. h DW (τ) is calculated from the correlation coefficient matrix, which is directly related to the signal itself, and it can be represented as:

Weighted Beamforming
The weighted beamforming process of a single array after obtaining the h DW (τ) is shown in Figure 3. First, the compensation time delay τ i (i = 1, 2 . . . . . . N) is performed on the signals x i (t) received by the array i A (i A = 1, 2 . . . . . . M) to obtain the compensation signals x i (t − τ i ). Then, the weighted signals h DW (τ)x i (t − τ i ) are calculated by multiplying x i (t − τ i ) and h DW (τ). Subsequently, the spatial integration of h DW (τ)x i (t − τ i ) at the same instant is performed, and the time integration of all signals at different times is operated. Finally, D i A (θ) (beam output of array i A ) is calculated. Since h DW (τ) contains spatial information, it can enhance the incoming wave direction and suppress the non-incoming wave direction, which can improve the directional ability of D i A (θ). signals x (t − τ' ). Then, the weighted signals h (τ)x (t − τ' ) are calculated by multiplying x (t − τ' ) and h (τ). Subsequently, the spatial integration of h (τ)x (t − τ' ) at the same instant is performed, and the time integration of all signals at different times is operated. Finally, D (θ) (beam output of array i ) is calculated. Since h (τ) contains spatial information, it can enhance the incoming wave direction and suppress the non-incoming wave direction, which can improve the directional ability of D (θ).

Signals of Array i A Compensation Signals
Weighted Signals ( ) The frequency domain expression can be written as: where X (e ) and X (e ) represent the Fourier transforms of the signals x (t) and x (t), respectively. As the number of arrays is M, the coordinates (x, y) of the sound source can be obtained by the following equation: where f (x, y) = . θ represents the direction of the source coordinate (x, y) relative to the normal direction of the array i . f (x, y) represents a mapping from coordinate space to direction space. When D is maximized, the corresponding (x , y ) is the maximum probability of the sound source.

Simulation Analysis
We used two sets of equally spaced linear arrays to simulate the sound source localization. The coordinates of the i-th microphone of the array I are ((i-1)×d, 0) (i = 1, 2, 3...16), D i A (θ) can be expressed as: and c is the speed of sound in the environment. When θ = θ 0 , D i A (θ) has a maximum value; at this time, θ 0 is the direction of the sound source.
The frequency domain expression can be written as: where X i e jω and X j e jω represent the Fourier transforms of the signals x i (t) and x j (t), respectively. As the number of arrays is M, the coordinates (x, y) of the sound source can be obtained by the following equation: where f i A (x, y) = θ i A represents the direction of the source coordinate (x, y) relative to the normal direction of the array i A . f i A (x, y) represents a mapping from coordinate space to direction space. When D all is maximized, the corresponding (x 0 , y 0 ) is the maximum probability of the sound source.

Simulation Analysis
We used two sets of equally spaced linear arrays to simulate the sound source localization. The coordinates of the i-th microphone of the array I are ((i − 1) × d, 0) (i = 1, 2, 3... 16), and the coordinates of the i-th microphone of the array II are ((i − 1) × d, 10). d is the distance between two adjacent microphones and d = 0.043 m. D is the distance between the array I and array II angle and D = 10 m. The sound source and array locations are shown in Figure 4. θ 1 and θ 2 are sound source directions relative to array I and array II, respectively. Assuming that the normal array is the beginning edge and the incoming wave direction is the end edge, θ 1 and θ 2 are positive counterclockwise and are negative clockwise. Moreover, θ 1 and θ 2 ∈ [− 90 • , 90 • ). The center of the array is used as a reference point, and the sound source coordinates (x, y) can be calculated by the following: y x−7.5d = tan(θ 1 + 90 • ) y−D x−7.5d = tan(θ 2 − 90 • ) (12) Micromachines 2022, 13, 1010 6 of 10 point, and the sound source coordinates (x, y) can be calculated by the following:   Figure 5 shows the direction estimation results of the SRP-PHAT and the SRP-MCCC when the source signal is a 600 Hz sine signal with source coordinates (14,4) and the SNR is −5 dB. The brightness of a point represents the probability that the point is the location of the sound source. Figure 6 shows the location estimation results of the SRP-PHAT and the SRP-MCCC when the SNR is −5 dB.   Figure 5 shows the direction estimation results of the SRP-PHAT and the SRP-MCCC when the source signal is a 600 Hz sine signal with source coordinates (14,4) and the SNR is −5 dB. The brightness of a point represents the probability that the point is the location of the sound source. Figure 6 shows the location estimation results of the SRP-PHAT and the SRP-MCCC when the SNR is −5 dB. point, and the sound source coordinates (x, y) can be calculated by the following:   Figure 5 shows the direction estimation results of the SRP-PHAT and the SRP-MCCC when the source signal is a 600 Hz sine signal with source coordinates (14,4) and the SNR is −5 dB. The brightness of a point represents the probability that the point is the location of the sound source. Figure 6 shows the location estimation results of the SRP-PHAT and the SRP-MCCC when the SNR is −5 dB.  From Figures 5 and 6, we can observe that the SRP-MCCC has a superior localization convergence capability and can improve the localization accuracy by aggregating the localization results into smaller areas and improving spatial pointing. To further explore the advantages of this method, simulation analyses of SRP-PHAT and the SRP-MCCC were From Figures 5 and 6, we can observe that the SRP-MCCC has a superior localization convergence capability and can improve the localization accuracy by aggregating the localization results into smaller areas and improving spatial pointing. To further explore the advantages of this method, simulation analyses of SRP-PHAT and the SRP-MCCC were conducted under different SNR conditions, and the results are shown in Table 1.  , (14, 4)). As shown in Table 1, when the SNR is high, the localization errors of both methods are comparatively insignificant. As the SNR decreases, the localization error of the SRP-PHAT increases significantly, while that of the SRP-MCCC remains at a low level. When the SNR is reduced from 10 dB to −20 dB, the position error of the SRP-PHAT increases from 1.77% to 14.71%, and the error of the SRP-MCCC only increases from 0.27% to 4.30%. Simulation experiments show that the SRP-MCCC is feasible. The localization effect is excellent, and the robustness is outstanding.

SNR/dB
To further explore the method proposed in this paper, the single-frequency signal at different source positions and the mixed-frequency signals at the same coordinate were simulated when SNR = −10 dB. The results are shown in Tables 2 and 3, respectively. We can infer that the average error of the single-frequency signal at different coordinates decreased by 5.69%, and that of the mixed-frequency signals decreased by 5.77% at the same coordinate.  Table 3. Simulation results for different frequencies (SNR = −10 dB, (15,5)).

Experiment
To verify the feasibility of the SRP-MCCC, a field localization experiment was conducted, as shown in Figure 7. The array structure and the sound source frequency are consistent with the simulation during the experiment, as shown in Figure 4. This paper presents an experimental comparison between the SRP-PHAT and the SRP-MCCC.

Experiment
To verify the feasibility of the SRP-MCCC, a field localization experiment was conducted, as shown in Figure 7. The array structure and the sound source frequency are consistent with the simulation during the experiment, as shown in Figure 4. This paper presents an experimental comparison between the SRP-PHAT and the SRP-MCCC. The direction estimation and position estimation results of SRP-PHAT and SRP-MCCC when the sound source coordinate is (10,5) are shown in Figures 8 and 9, respectively. From Figures 8 and 9, it can be found that the spot size of the SRP-MCCC, in both angle and position, is smaller than SRP-PHAT. Therefore, the SRP-MCCC has a stronger angular resolution and can control the beam width in a smaller range, which can lead to more accurate direction and position estimation. The direction estimation and position estimation results of SRP-PHAT and SRP-MCCC when the sound source coordinate is (10,5) are shown in Figures 8 and 9, respectively. From Figures 8 and 9, it can be found that the spot size of the SRP-MCCC, in both angle and position, is smaller than SRP-PHAT. Therefore, the SRP-MCCC has a stronger angular resolution and can control the beam width in a smaller range, which can lead to more accurate direction and position estimation.  Subsequently, localization experiments were conducted for the sound source coordinates of (11,2) and (12, 7.5), and the experimental results are shown in Table 4. Compared with the SRP-PHAT, the SRP-MCCC has higher accuracy in both direction and position estimation, and the average error is reduced by 8.14%. Therefore, SRP-MCCC has a better localization effect, which is in line with the expected results and verifies the feasibility and superiority of the SRP-MCCC. Subsequently, localization experiments were conducted for the sound source coordinates of (11,2) and (12, 7.5), and the experimental results are shown in Table 4. Compared with the SRP-PHAT, the SRP-MCCC has higher accuracy in both direction and position estimation, and the average error is reduced by 8.14%. Therefore, SRP-MCCC has a better localization effect, which is in line with the expected results and verifies the feasibility and superiority of the SRP-MCCC.

Conclusions
A new high-precision beamforming algorithm (SRP-MCCC) is proposed to improve the positioning accuracy by combining SPR and MCCC. A detailed theoretical analysis of the method is presented here, and the simulations and experiments verify its feasibility. The results show that the method has the advantages of strong robustness and high localization accuracy. Furthermore, the SRP-MCCC has better spatial resolution and localization capability than the SRP-PHAT. Both the simulations and experiments verify the effectiveness of the method. These results provide a new idea for the weighted beamforming algorithm, which is essential for researching high-precision sound source localization in complex environments.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.