Passive Acoustic Source Localization at a Low Sampling Rate Based on a Five-Element Cross Microphone Array

Accurate acoustic source localization at a low sampling rate (less than 10 kHz) is still a challenging problem for small portable systems, especially for a multitasking micro-embedded system. A modification of the generalized cross-correlation (GCC) method with the up-sampling (US) theory is proposed and defined as the US-GCC method, which can improve the accuracy of the time delay of arrival (TDOA) and source location at a low sampling rate. In this work, through the US operation, an input signal with a certain sampling rate can be converted into another signal with a higher frequency. Furthermore, the optimal interpolation factor for the US operation is derived according to localization computation time and the standard deviation (SD) of target location estimations. On the one hand, simulation results show that absolute errors of the source locations based on the US-GCC method with an interpolation factor of 15 are approximately from 1/15- to 1/12-times those based on the GCC method, when the initial same sampling rates of both methods are 8 kHz. On the other hand, a simple and small portable passive acoustic source localization platform composed of a five-element cross microphone array has been designed and set up in this paper. The experiments on the established platform, which accurately locates a three-dimensional (3D) near-field target at a low sampling rate demonstrate that the proposed method is workable.


Introduction
Passive acoustic source localization has been extensively investigated in the last two decades. Time delay of arrival (TDOA)-based methods are widely used in this area for their simple implementation and small computational complexity [1]. Firstly, TDOA-based methods estimate the time delay between two spatially-distributed microphones. Secondly, acoustic source location is derived from the corresponding non-linear localization equations according to TDOA estimations and the microphone array geometric position. An accurate time delay estimation (TDE) is essential for the good performance of the acoustic source location based on the TDOA method, since any error in the TDE leads to a high error of the target location estimation [2]. In general, the generalized cross-correlation (GCC) method is applied to estimate the TDOA. In practice, the performance of time delay estimation based on the GCC method is dependent on the sampling rate [3], namely a high sampling rate [4][5][6] can contribute to a high localization accuracy. However, some localization systems (e.g., wearable localization systems, hearing aid and human-computer interactions) tend to be small and portable with the development of the integrated circuit, electronic and computer technology, etc. For theses portable systems and micro-embedded systems [7,8] it is challenging to improve the localization accuracy by increasing the sampling rate, because of the limitation of the system size, hardware and power consumption, etc. Hence, it is important that sampling rate conversion (SRC) be exploited to improve localization accuracy under a low sampling rate. There is the up-sampling (US) method in the SRC field that can increase the original sampling rate of the input signal. In this sense, to achieve accurate localization at a low sampling rate, a modification of the GCC method is proposed based on the US theory and defined as the US-GCC. The US theory can be used to complete an interpolation processing in regards to the signals sampled under the low sampling rate. The TDOA between the output signals through the interpolation processing is then estimated by the GCC method. In addition, the reasonable interpolation factor is the crucial problem for the US theory. Increasing the interpolation factor can result in the increase of the sampling rate, as well as the improvement of the TDOA estimation and localization accuracy. Meanwhile, localization computation time and storage space will increase with the increase of the interpolation factor. Therefore, for the near-field acoustic source in this paper, the optimal interpolation factor is selected according to localization computation time and the standard deviation (SD) of target location estimation.
The remainder of this paper is organized as follows. In Section 2, a new localization algorithm is described under the low sampling rate based on GCC method and US theory. Meanwhile, the optimal interpolation factor is discussed and selected for the US operation. In Section 3, for evaluation purposes, localization error comparison based the GCC method and the proposed method is presented by simulation. Furthermore, the established simple portable passive acoustic source localization platform can complete accurate acoustic source localization at a low sampling rate. In Section 4, the paper is completed with some concluding remarks.

Proposed Localization Algorithm Based on the Generalized Cross-Correlation Method and Up-Sampling Theory
Generally, the TDE of one pair of microphones has been acquired using the GCC method [9]. Considering x 1 (n) and x 2 (n) as the received signals at Microphone 1 and Microphone 2, the cross-correlation function between x 1 (n) and x 2 (n) is written as: where τ i (i = 1, 2) is the propagation time from the acoustic source to the microphones and τ 1 − τ 2 is the time delay between the signals arriving at Microphone 1 and Microphone 2. According to the characteristics of the cross-correlation function, R x 1 x 2 (τ ) ideally should exhibit a prominent peak when τ = τ 1 − τ 2 . That is, the time delay estimation τ m is obtained via maximizing the cross-correlation function defined by Equation (1): In practice, to sharpen the cross-correlation function peak and limit the impact caused by noise and reverberation, the cross-correlation function is transformed into the cross-correlation spectrum function through Fourier transform. Then, the weighting function is employed for the cross-correlation spectrum function. Finally, through Fourier inverse transform of the weighted cross-correlation spectrum function, the generalized cross-correlation function is defined as: where G x 1 x 2 (f ) is the cross-correlation spectrum density function, Ψ x 1 x 2 (f ) is a weighting function and f is the frequency variable. For many different weighing functions, a commonly-used weighting function in acoustic event localization is the phase transform (PHAT), which is usually considered useful in reverberant conditions [10] and has low computational complexity and a higher recognition rate [3,4]. It can be described with the following equation: Inserting Equation (4) into Equation (3), the estimation of the TDOA for each microphone pair is computed as follows: However, due to the discretization of the input signal, the obtained TDE from Equation (5) must be converted into Equation (6): where τ ij , ∆n ij , T and F are TDOA estimation, sampling point, sampling period and sampling frequency, respectively. It is clear that increasing the sampling frequency F results in the reduction of the error of the TDOA estimation τ ij . Yet, for a small portable system, especially for a multitasking micro-embedded system, to improve the localization accuracy depending on a high sampling rate is quite difficult because of the limitation of the system size, hardware and power consumption, etc. In the SRC field [11,12], US theory usually is used to increase the sampling rate of the input signal. Therefore, in order to complete the accurate localization at a low sampling rate, a modification of the GCC method based on the US theory is proposed and defined as the US-GCC method.
The US operation with a positive integer interpolation factor L is implemented by equidistantly inserting L − 1 zero-value sample points between two consecutive samples of the input signal, as shown in Equation (7): where x(n) is the input signal and y(n) is the output signal through the US operation. The US develops y(n) with a sampling frequency that is L-times larger than that of x(n), namely: where F x and F y are the sampling frequency of x(n) and y(n), respectively. In addition, in terms of the z-transform, the input-output relation is then given by Equation (9): By substituting z=e jω into Equation (9), the obtained Y (e jω )=X(e jωL ) shows that the frequency spectrum of y(n) is L-times the repetition of the frequency spectrum of x(n) after the US operation. In addition, because of the L-times sampling rate expansion, there will be L − 1 additional images of the frequency spectrum of the input signal. Clearly, a low-pass filtering is employed to remove the L − 1 additional images.
Here, based on the US theory and the GCC method, the proposed localization algorithm under a low sampling rate is shown in Figure 1, and the process comprises the following steps. Step 1: Through the US operation (interpolation factor L) for the collected discrete signal (a sampling rate of less than 10 kHz) from the microphone array, we will get the output signal with the higher sampling rate.
Step 2: TDOAs of the output signals through the processing in Step 1 are estimated by the GCC method.
Step 3: Acoustic signal location can be estimated according to the TDE in Step 2 and the microphone array geometric model.

Parameter Analysis
With reference to the US-GCC method, apparently the interpolation factor selection is crucial in order to effectively improve the localization accuracy. The interpolation factor is too small to effectively reduce the acoustic source location error, or it is too big that will it increase the calculation complexity and computation time. Then, the main results of the interpolation factor parameter analysis are given in the following theorem and inferences.
Theorem for the GCC method: The localization accuracy is relevant to the sampling rate, namely the high localization accuracy needs a high sampling rate.
Proof of the theorem: Firstly, the error of the τ ij is presented by Equation (10) taking into account a derivative with respect to τ ij of Equation (6) in Section 2.1.
where δτ ij , δ∆n ij and δF represent the error of τ ij , ∆n ij and F , respectively. Then, a single speech signal respectively is placed at (0.6 m, 0.7 m, 0. . The adopted sampling frequency is from 8 kHz to 320 kHz (step size of 8 kHz), and the noise is 30 dB Gaussian noise. Therefore, localization errors based on the GCC method under the different sampling rates are given in Figure 2. Obviously, increasing the sampling frequency F can reduce the error of the TDOA estimation τ ij and localization results. Thus, the above discussion demonstrates that the theorem is always tenable.
Inference 1: Localization errors rapidly decrease with the increase of the sampling rate and start to level off when the sampling rate is over 100 kHz (as shown in Figure 2). Hence, for the speech signal with the sampling rate of 8 kHz according to the G.711 standard, the minimum of the interpolation factor should be greater than or equal to 13, if the sampling rate reaches more than 100 kHz.
Inference 2: For the near-field 3D localization based on the proposed method, 15 is the optimal interpolation factor.
Firstly, for a near-field speech signal (sampling rate of 8 kHz) at coordinates ranging from (0.5 m, 0.6 m, 0.7 m) to (3.5 m, 3.6 m, 3.7 m) (step size of 0.1 m), localization error curves with different interpolation factors based on the US-GCC method are shown in Figure 3. Apparently, the error curve change with the interpolation factor of 13 (green curve) is larger and gradually increases, and also, the localization error with the interpolation factor of 15 (red curve) is smallest compared with the other curves.
Further, the standard deviation (SD) of the acoustic source location based on the proposed method is used to select the optimal interpolation factor. In terms of statistics, the SD is defined as the uncertainty parameter, which represents the error impact on the estimated results. Namely, lower uncertainty illustrates a smaller error value range, which leads to lower error impact on the estimated results and higher estimation accuracy. Meanwhile, when estimation points are more than 10, SD should be given by Equation (11) according to Bessel formula: where n represents the estimator number, v represents the difference between the true value x i and the estimated value x . When the interpolation factor respectively is set to 13, 14, 15, 16 and 20, the SD estimation via Equation (12) can be obtained substituting 25 (n = 25) estimation points from (0.5 m, 0.6 m, 0.7 m) to (3 m, 3.1 m, 3.2 m) (stage size of (0.1 m, 0.1 m, 0.1 m)) into Equation (11): Obviously, the SD of the estimation result with an interpolation factor of 15 is minimum compared with the others. Hence, in this paper, 15 is selected as the optimal interpolation factor for the near-field 3D localization based on the US-GCC method.

Simulation and Experiment
To verify the feasibility and the superiority of the proposed localization algorithm in Section 2, firstly, localization results and the computation time based on the GCC method and the US-GCC method at a low sampling rate are computed via numerical simulations. Then, localization experiments have been conducted indoors based on the established simple and small portable passive acoustic source localization platform with a five-element cross microphone array (hardware size of the control part: 15.3 cm × 22.5 cm).

Comparison of Localization Result and Computation Time Based on the GCC Method and the US-GCC Method
In this subsection, the simulation parameters are explained as follows: (1) Source location (as shown in Figure 4): a single speech signal recorded by the computer in a quiet environment that can be played back through a speaker. The final signal is sampled via the sampling rate of 8 kHz and assuming that it is collected by a five-element cross microphone array (see (2) Noise model: mutually-independent white Gaussian noise is added to each microphone signal. The signal-to-noise ratio (SNR) is set to 10 dB, 20 dB and 30 dB.
(3) Interpolation factor of the US: 15.  The comparison of the simulation results based on the GCC method and the US-GCC method at a low sampling rate is described in Table 1.
Defining e tradition and e interp as the absolute error of the localization results on the GCC method and the US-GCC method, respectively, the ratio of the absolute error of both methods can be written as: Equation (13) shows that the absolute error of the localization results based on the US-GCC method is from 1/15-to 1/12-times that based on the GCC method with the same sampling rate. Therefore, the proposed method significantly improves the accuracy of the TDE and, consequently, the acoustic source location estimated at a low sampling rate.
where ∆r is the distance absolute error, (x, y, z) and (x , y , z ) respectively are the real source location and the estimated source location based on the GCC method and the US-GCC method.
Next, the localization computation times (as shown in Figure 5) based on the GCC method and US-GCC method (with the different interpolation factors) are calculated in the advanced reduced instruction set computing machines (ARM7:LPC2148). The main frequency is 60 MHz, and the sampling points are 3500. Apparently, localization computation time based on the GCC-US method with the interpolation factor of 15 is 10.825 ms and only 8.415 ms more than the 2.410 ms based on the GCC method.

Passive Acoustic Source Localization Platform
The hardware part of the established localization platform mainly includes a five-element cross microphone array, a signal preprocessing circuit and an MCU. The five-element cross microphone array is employed to receive the acoustic signal. After the amplifier circuit and the filter circuit, the signal then is sent to the upper PC through the MCU for software processing and showing the localization results.

Five-Element Cross Microphone Array
Localization model of the five-element cross microphone array The minimum number of microphones required for 3D localization is four. Yet, more microphones will increase the complexity of the localization algorithm, so in this paper, the five-element cross microphone array [13] is employed because of its higher reliability and accuracy compared with the four-element cross array. The localization model of the five-element cross array is shown in Figure 6.  Considering the acoustic source as a point source and the microphone M 0 as a reference point, thus according to Distance = T ime × Speed and the geometrical model of the five-element cross microphone array, the localization equations are written as: where r is the distance between the acoustic source and the coordinate origin, τ i (i = 1, 2, 3, 4) is the time delay between microphone M 0 and M i (i = 1, 2, 3, 4) and c is sound velocity (in this paper c = 340 m/s) and under the assumptions of a constant speed for an indoor experiment and a near-field source localization [4,14,15]. In addition, there is mathematical relationship between the Cartesian coordinates and the spherical coordinates obtained by Equation (15): Therefore, for the near-field localization, the signal location estimations are calculated via substituting Equation (15) into Equation (14): On the one hand, Equation (16) shows that signal location estimations can be obtained as long as estimating the TDOA, and a larger TDOA error will significantly decrease the localization accuracy. On the other hand, based on the above equations, the impact of the array elements' spacing and the angle on the signal location parameter accuracy is discussed and analyzed.
Taking the partial derivative of the Equation (16) with respect to the TDOA, one can obtain Equation (17): Therefore, the relational expression of the distance variance can be written as follows: Similarly, taking the partial derivative of the Equation (16) with respect to the TDOA, the azimuth angle variance Equation (19) and the pitch angle variance Equation (20) can be written as follows: Obviously, besides the TDOA, the array elements' spacing D and signal pitch angle θ also have an impact on the location parameter accuracy. Therefore, assuming the constant TDOA variance (σ τ = 0.0001) in Equations (18)-(20), the relationship between the location parameter and parameter variance is discussed and shown in Figure 7. Figure 7a demonstrates that the target distance variance increases with the increase of the pitch angle, and also, increasing the array elements' spacing can reduce distance variance. Increasing the array elements' spacing and pitch angle contributes to the decrease of the azimuth angle variance in Figure 7b. This further illustrates that the five-element cross microphone array is more advantageous to locate the azimuth angle of a low altitude target. From Figure 7c, the pitch angle variance reduces by increasing the array elements' spacing or decreasing the pitch angle.
Hardware design of the five-element cross microphone array Five electret microphones are fixed on four endpoints and a center of the 2 m × 2 m cross wooden support, respectively (as shown in Figure 8). Meanwhile, to reduce electromagnetic interference, the shielded wire is employed as the guide line that connects five microphones to the preprocessing circuit's PCB.
In this paper, the reason for using electret microphones is that they are often very inexpensive and have a simple structure, small size, are light weight, have a wide frequency response ranging from 20 Hz to 20 kHz and a small transient distortion [16].

Signal Preprocessing Circuit
The signal preprocessing circuit (as shown in Figure 9) is designed to amplify and filter weak output signals from five microphones. For a speech signal with a general frequency range from 300 to 3400 Hz and a wider pass-band width, a low pass filter and a high pass filter are exploited to remove noises, and also, their cutoff frequencies are 3400 Hz and 300 Hz, respectively. Moreover, the second amplifying of the two amplifying circuits (total amplification factor: 20) is employed following the two-level filtering circuit later.   The electret microphone is applied to receive and convert the acoustic signal into the electric signal, as well as to amplify the converted signal through the field effect transistor (FET); (b) the FET completes the signal amplification as long as it works in the saturated zone that needs a matching circuit. In general, the resistance of R 1 is always higher from three-to five-times the output resistance of the microphone. By testing, the output resistance of the microphone is about 2 kΩ, so the resistance of R 1 is set to 9.1 kΩ.

MSP430F149 MCU
The smallest development board, the TIMSP430F149, can easily record a program because of being loaded with the RS232 communication module, reset module and power module, etc. Hence, it is widely applied as the core control of the signal processing. However, the MSP430F149 MCU can only complete sampling and conversion for a single signal at a time, namely it cannot achieve synchronous sampling for multiple signals. Therefore, the system collects the acoustic signal using the alternating sampling mode of the MSP430F149 MCU. Yet, there is a sampling time delay between adjacent channels that should be calculated for the TDE compensations. Defining the sampling time delay T S as: where T h is hold time, T t is conversion time, ADCclk is an ADC12clock source (8 M) and F adc is the frequency of the ADC12 equivalent circuit. Therefore, the final TDE is presented by Equation (22): whereτ ij is the TDE based on the GCC method.  Figure 9. The signal preprocessing circuit includes two amplifying circuits and a two-level filtering circuit.

Localization Experiment and Discussion
To verify the distributed acoustic source localization capabilities of the constructed localization platform under a low sampling rate, localization experiments are carried out in a room with a low reverberance (as shown in Figure 10). The room dimension is 9 m × 8 m × 3 m (x×y ×z). Additionally, considering different environmental noise sources (from fans, PCs, lights, a few babble noises from outside, etc.), the noise field can be approximated as a diffuse one.

Five-element cross microphone array Speaker box
Control core Figure 10. Passive acoustic source localization experimental platform with a five-element cross microphone array.
The experimental conditions are explained as follows: (1) Array structure and location: a five-element cross microphone array with a spacing of 1 m.
(2) Acoustic source: a single speech signal from a speaker box. Firstly, the five-element cross microphone array receives the speech signal (emitted by the speaker box) placed at some different coordinates. Then, the received signals after amplifying and filtering are sent to the upper PC via the MCU for the subsequent localization processing based on the proposed US-GCC method and showing the localization results. At the same time, the endpoint detection of the speech signal and reverberation suppression are generally processed for the acoustic source localization platform (see Appendixs A and B). Finally, experimental results (as shown in Table 2) at a low sampling rate show that relative errors of the distance r, the azimuth angle ϕ and the pitch angle θ, respectively, are about 20%, 10% and 20% within a certain distance. Meanwhile, the real location and estimated location of the acoustic source (as shown in Figure 11) present that the localization accuracy based on the US-GCC method has been significantly improved, compared with that based on the GCC method.  In addition, the localization performances of the established platform at a low sampling rate are depicted in Figure 12.  1,0.1,1.5) (0.2,0.2,1.5) (0.3,0.3,1.5) (0.6,0.8,1.5) (1.5,1.5,1) (1.7,1.7,1.5) (1.8,1.8,1.5) (1.9,1.9,1.5) (2.0,2.0,1.5) (a)  Figure 12b, the azimuth angle ϕ has a higher accuracy compared with both the distance r and the pitch angle θ. The relative errors of the azimuth angle ϕ are less than 20%, and the relative errors of both the distance r and the pitch angle θ are from 20% to 30%. Figure 12c demonstrates that the relative errors of the pitch angle θ increase with the increase of the pitch angle θ. When the pitch angle θ is less than 70 • , its relative errors are circa 20%. Figure 12d shows that increasing the array element spacing can contribute to reducing the localization accuracy. At the same time, the above experimental results and discussions also validate the mathematical analysis for the five-element cross microphone array model.
Finally, comparisons of the experimental results between our system and the research results of [17,18] are presented in Tables 3 and 4. From Table 3, the absolute errors of the arrival angle of two constructed systems are approximate. Meanwhile, the reduction of the number of microphones in turn leads to a reduced localization accuracy [3]. Therefore, as shown in Table 4, the absolute errors of the pitch angle of our system increase by ±0.2 • only within 2 m, compared with the system of [18], which uses an eight-element microphone array. In this sense, the system accuracy in this paper basically meets the 3D near-field localization requirements. Table 3. Comparison of the experimental results between our system and the system of [17].  Table 4. Comparison of the experimental results between our system and the system of [18].

Conclusions
For a small portable system, especially for a multitasking micro-embedded system, a modification of the GCC method based on the US theory is proposed to improve the TDOA accuracy and, consequently, the localization accuracy at a low sampling rate. In addition, for the near-field localization, the localization error curve and computation time based on the US-GCC method under the different interpolation factors are given in this paper. According to the SD of the location estimation and localization computation time, the optimal interpolation factor is set to 15. The simulation results show that absolute error of the localization results based on the US-GCC method with the interpolation factor 15 is approximately from 1/15-to 1/12-times that based on the GCC method with the same sampling rate. Finally, our designed and established portable acoustic source localization platform based on the proposed method can perform accurate 3D near-field localization at a low sampling rate, and also, the possibility is given for applying the US-GCC method with an interpolation factor of 15 to a small portable system, especially a multitasking micro-embedded system.

Author Contributions
Yue Kan, Fusheng Zha and Pengfei Wang wrote the manuscript and formulated the idea. Mantian Li, Wa Gao and Baoyu Song participated in structuring and editing the manuscript. Hongyu Zhu and Yingcui Liu participated in formulating the idea, the experimental platform and data collection and analysis.

Conflicts of Interest
The authors declare no conflict of interest.

Appendix Appendix A
Under a silent period of the speech signal, only complicated environmental noises are collected. Thus, to determine the silence signal or speech signal, using endpoint detection is necessary. In general, the speech signal is non-stationary, but it can be assumed stationary for short time scales (from 10 ms to 30 ms). Therefore, the speech signal is divided into overlapping frames. The frames are then windowed using an analysis window function. Relying on this characteristics, a short time energy and a short time zero crossing rate [19,20] can be used for the endpoint detection.
The short time energy [19] is defined as: where expression E represents the energy of the signal x(m), w(n − m) is the window function and N is the window length. In this paper, a Hamming window that has a window length of 20 ms is employed as the analysis window function. High energy would be classified as voice and lower energy as silence, namely setting the threshold to classify the speech as voice or silence. If the calculated signal energy is lower than the threshold, the speech is classified as silence, whereas if the energy is more than the threshold, the speech is classified as voice.
In addition, the zero crossing rate (ZCR) counts the number of zero crossings in the speech signal. Voiced segments have a low ZCR compared with unvoiced segments. The definition of the short time zero crossing rate is as follows: The ZCR is very useful for discriminating speech from noise and for determining the start and end of a speech segment. Lower energy in the ZCR would be classified as voice and high energy as silence.

Appendix B
In the GCC method, if received signals are free of reverberation and are properly filtered, the GCC method reduces to the maximum likelihood time delay estimator [21]. However, in a typical room, there are direct and reflected speech signals, namely reverberation. The presence of reverberation in the received signals has disastrous effects on the performance of the GCC method. Considering S(t) as a single speech signal, collected signals s 1 (t) and s 2 (t) at Microphone 1 and Microphone 2 respectively become: where τ d is the time delay of the direct signal between microphones, τ 1 is the time difference of arrival between the reflected signal and the directed signal arriving at Microphone 1, τ 2 is the time difference of arrival between the reflected signal arriving at Microphone 2 and the directed signal arriving at Microphone 1 and η is the amplitude ratio of the reflected speech signal to the direct speech signal. The cross-correlation function of s 1 (t) and s 2 (t) can be defined as: R s 1 s 2 (τ ) = ηδ(u − τ 2 ) + η 2 δ(u − (τ 1 − τ 2 )) + δ(u − τ d ) + ηδ(u − (τ 1 − τ d )) (B.2) where δ(u) is the Dirac delta function. According to the characteristics of the GCC method, from Equation (B.2), it is clear that the cross-correlation function has four peaks when u = τ 2 , u = τ 1 − τ 2 , u = τ d , u = τ 1 − τ d , leading to it not being able to determine the correct delay time. Thus, it is necessary to remove the reverberation. In this paper, the cepstral prefiltering technique [21] is applied on the received signals before the TDOA estimation in a typical reverberant environment. The cepstrum is defined as the inverse Fourier transform of the log-spectrum of a stationary random process [22], where the cepstrum of a discrete-time signal x[n] is given by: where X(ω) is the Fourier transform of x[n], F −1 {·} represents the inverse Fourier transform, the log operator is the complex logarithm and the integer variable k is called quefrency [21]. In the complex cepstrum domain, the complex cepstrum of the speech signal is near the origin; however, the complex cepstrum of the reverberation signal is far away from the origin. Based on this characteristic, cepstral prefiltering can be adopted to deal with the reverberation.