Study of Generalized Phase Spectrum Time Delay Estimation Method for Source Positioning in Small Room Acoustic Environment

This paper considers the application of signal processing methods to passive indoor positioning with acoustics microphones. The key aspect of this problem is time-delay estimation (TDE) that is used to get the time difference of arrival of the source’s signal between the pair of distributed microphones. This paper studies the approach based on generalized phase spectrum (GPS) TDE methods. These methods use frequency-domain information about the received signals that make them different from widely applied generalized cross-correlation (GCC) methods. Despite the more challenging implementation, GPS TDE methods can be less demanding on computational resources and memory than conventional GCC ones. We propose an algorithmic implementation of a GPS estimator and study the various frequency weighting options in applications to TDE in a small room acoustic environment. The study shows that the GPS method is a reliable option for small acoustically dead rooms and could be effectively applied in presence of moderate in-band noises. However, GPS estimators are far less efficient in less acoustically dead environments, where other TDE options should be considered. The distinguishing feature of the proposed solution is the ability to get the time delay using a limited number of the adjusted bins. The solution could be useful for passively locating moving emitters of narrow-band continual noises using computationally simple frequency detection algorithms.


Introduction
The problem of time-delay estimation (TDE) is to measure the difference in the time of arrival of signals recorded by space-separated sensors. This task is relevant for many applications, including those which are related to signal source localization [1]. The position of the object can be determined on the straight line [2,3], on the plane [4,5], and in space [6][7][8] depending on the location and the number of sensors.
The use of TDE methods is typical for those areas of technology where there is a need for the passive location of objects emitting signals. The physical nature of the signal, however, is not essential. Among practical applications, we can highlight the pipeline leaks position determination [2,3], local mobile objects positioning [9], passive radio positioning [1], etc. In recent years, the problem of TDE has become more relevant in connection with the spread, on the Internet, of concepts and services providing contactless control of household appliances [10], automatic tracking of objects [7], as well as in the sensor systems of robotic devices [11]. A common problem in the implementation of each of the listed

Materials and Methods
The most studied and widespread TDE technique is based on cross-correlation functions computation (CCF) [2]. CCFs are calculated for different time series pairs of sampled microphone signals, based on the position of the maximum in a correlogram. An alternative to the TDE correlation methods are phase-frequency methods, suggested firstly in [17]. Unlike correlation methods which analyze signals in the time domain, phase methods operate with signals frequency-domain representations. This section is devoted to the phase methods of TDE.
This paper considers the simplest case with two sensors, shown in Figure 1. Obviously, two sensors are not enough for unambiguous signal source localization on a plane or in space [11]. Depending on the relative sensor's position and the position of the signal source, a pair of microphones may be sufficient to determine the direction towards the object. In general cases, at least three sensors are required to determine the position of the source in a room [16]. In this case, the signals of the sensors array can be processed both simultaneously and in pairs [8]. The latter means that the algorithm considered in the paper can be used to localize the signal source in a room using three or more microphones.

Ideal Propagation Model
The TDE task for sound source detecting in a room can be formalized in several ways [8]. Each method is a compromise between the signal propagation model accuracy and the complexity of the mathematical description of the problem. The main acoustic signal

Ideal Propagation Model
The TDE task for sound source detecting in a room can be formalized in several ways [8]. Each method is a compromise between the signal propagation model accuracy and the complexity of the mathematical description of the problem. The main acoustic signal propagation models are [8]: ideal propagation model, multipath propagation model, and reverberation model. In this work, we consider that the simulated microphones are equally capable of efficiently registering signals coming from any direction. The ideal propagation model assumes that there is only one path from the signal source to each of the microphones. Let s 0 (t) be the signal emitted by the source. Then the signals of the receivers will be where τ a , τ b are lag values; α a , α b are signal attenuation coefficients; n A (t), n B (t) are random uncorrelated additive microphone noises. The values of τ a , τ b are determined by the geometric distances r a , r b from the signal source to the corresponding receiver where c is the sound speed. Attenuation of signals α a , α b can be caused by various factors, however, in the simplest ideal case, exclusively source beam pattern and the scattering of the sound wave are considered and, so where k is a constant coefficient. In this case, the TDE is performed to get the value τ ab = τ b − τ a which is used further to determine the position of the sound source. Using the notations above and having redefined t = t − τ b , we can rewrite (1) Expression (4) does not consider the influence of several physical factors, such as reflection and absorption of sound in a room.
Later, in the course of computational experiments with the ideal scenario, we will take that k = 1, since the target signal-to-noise ratio (SNR) can be achieved exclusively by changing the noise intensity.

Reverberation Model
The problem of the ideal propagation model is that the assumptions made do not correspond to the acoustic conditions of the real-world enclosed room. Firstly, there are always several paths for sound propagation between the source and the receiver due to the presence of reflected waves. Secondly, the absorption of sound energy by room surfaces has a significant effect on the recorded signal.
In accordance with the reverberation model, the received signals are described as follows where h a (t), h b (t) are room impulse response (RIR) functions. The complexity of application of (5) is in the practical difficulty of RIR determination. Acoustic measurements [18] or mathematical methods can be used to solve this problem. The image model method, first proposed in [19], is the most widespread among the latter. Alternatively, statistical methods [20] or methods based on geometric acoustics and ray tracing [21] can be used. To create realistic sound signals in this work, the image model method was used in the implementation of Lehman, Johansson and Nordholm [22,23].

Basic Phase Shift TDE
The phase TDE algorithm is based on obtaining information about the delay value from the cross-phase spectrum Φ ab of two signals. The algorithm for constructing the cross-phase spectrum is known from spectral analysis [14]. At the initial stage, the Fourier transforms S a (f k ) and S b (f k ) of the signals of each of the channels are determined where s a (t i ) and s b (t i ) are series of N real samples of s a (t) and s b (t) signals sampled with an interval ∆; F D is the operator of short-time discrete Fourier transform (DFT); S a (f k ) and S b (f k ) are spectrums of the signals.
Further instantaneous cross-spectrum of signals S where superscript (q) indicates the time instant t q = ∆·N·q of the beginning of the q-th time window; * is the element-wise complex conjugation; × is the element-wise product. The final measurement of the cross-spectrum S ab (f k ) is obtained by averaging the Q instantaneous spectrums It should be noted that the application of (8) requires that the signal source remains stationary relatively to the receivers during the entire time of signal recording. If it is not, the spectral estimation S ab (f k ) would not be correct. However, this assumption is normally relevant for the cross-spectrum. If we consider that neither source nor sensors are moving, the phase shift for each particular harmonic component will remain the same for all Q instantaneous spectrums. Therefore, coherent accumulation is applied this way to reduce the impact of the additive random noise.
To retrieve the set of phases, the phase cross-spectrum Φ ab (f k ) is finally calculated where U is an operator of phase unwrapping [24]; arg is the operator for defining the argument of a complex number. All harmonic components presented in s 0 (t) will also be present in s a (t) and s b (t). In this case, the phase difference between the k-th harmonic components of s a (t) and s b (t) is determined by τ ab ·f k . Therefore, the estimation τ ab can be obtained as the coefficient of proportionality in the line equation of the approximating Φ ab (f k ).
The valueτ ab can be determined, for example, based on the criterion for minimizing the squared error function [14]. Let the error e be determined as where b ab is a constant term. Then Equating the derivatives to zero in (11) results in where values A, C, B, D can be computed with the proposed scheme An advantage of the algorithm based on the use of (12) and (13) is that non-adjacent spectral bins can be used for TDE. It is optimal to choose k ∈ S, where S is a set of the most essential harmonic components of the signal s 0 (t).

Generalized Phase Spectrum TDE
A modification of the method described in the previous subsection can be used to localize stationary signal sources. The modified method was initially proposed in [15] and was named GPS TDE.
A distinctive feature of the generalized method is the use of real-valued frequency weight function W(f k ) which is used to determineτ ab . Similarly to (10), the weighted error in this case are introduced Obtaining a calculation formula forτ ab could be carried out in the same way as in the previous subsection It is clear from (14) that the functions W(f k ) should be chosen in the way that its value is high if the useful signal prevails over noises at the f k frequency and differs little from zero in other cases. A set of five frequency weighting functions was investigated in [14]. Table 1 below shows the calculation formulas for these functions.

Method
Nomenclature Formula The coherence function γ 2 ab (f k ) widely used for this purpose is calculated as It should be noted that the computational scheme proposed in this section differs from the one in [14]. Equation (15) allows the unwrapped phase spectrum to not pass through the origin, as far as we used coefficient b ab in linear regression. This feature is practically important and will be addressed later. As far as W(f k ) is based on spectral estimations, the generalized method should be applied carefully for signals that are non-stationary.

Results and Discussion
A series of computational experiments were carried out for a comparative evaluation of the algorithms. The human voice is commonly used for evaluation purposes in related studies [7,8]. Prior to the proposed study, we have tested algorithm performance for several speakers but did not find a significant difference in the results. Therefore, we have used the recording of one speaker and focused the study mainly on evaluating the impact of additive noise and multipath propagation in a reverberant environment.
A recording of a male speaker's voice with additive random noise was used to produce a set of test signals. The noise-free sound was synthesized based on the recorded voice by each of two means: in accordance with (4) and in accordance with (5).
Additive noises were generated by software, then scaled and summed with the preprocessed recording. The spectral noise density was equal in the range from 0 to 1000 Hz. Signals and noises outside of this frequency range were not considered in the experiments. A similar approach to preparing the set of test signals was used in [25].
Noises of the same intensity were applied to both channels. At the same time, the intensity of the noise was set in such a way as to provide the target SNR relative to the root-mean-square value of the signals recorded by the sensors for the entire time of each instance of the experiment. When applying (1), the delay was introduced by shifting one copy of the record relative to another by an integer number of sampling intervals (f d = 44,100 Hz).

Experimental Setting
A set of stereo test records with a duration of about 20 s each were prepared for the study. The recording was analyzed in fragments of about 1 s during each instance of the experiment. At the same time, the analysis of each of the fragments was considered an independent experiment. The final estimations used to calculate the absolute error were obtained by averaging obtained values of the lag time.
The number of samples in each of the analyzed fragments was L = 40,960 (about 928.8 msec). The number of samples in the segment was taken to be N = 4096 (about 92.9 msec). Consequently, each piece of recording sound was fragmented into Q = 10 segments. When processing the results, the outputs corresponding to the segments of the recording, where pauses in speech predominated, were discarded.
Two different sets of frequency bins were used when applying (16). The first set contained frequency bins corresponding to the condition f k [100 Hz, 850 Hz]. The second set contained four non-overlapping frequency bands shown below. The choice of such frequency intervals was carried out in accordance with the form of power density spectrum of the raw signal shown in Figure 2. The presented characteristic was obtained by averaging all instantaneous power density spectrums with a window of N = 4096 samples. The position of the cut-off level was chosen empirically to optimize the TDE operation in the absence of reverberations. It should be noted that the power density spectrum for different speakers or even for different speech fragments by this speaker would not remain the same. However, the proposed procedure will remain applicable regardless.

Simulation of the Small Room Environment
As noted above, creating a realistic sound signal in accordance with (5) requires obtaining RIR functions h a (t), h b (t). The MATLAB program prepared by Eric Lehman [22] was used to obtain these characteristics. When calculating the RIR, the room parameters and the configuration of the sensors were specified as shown in Figure 3. The dimensions of the room were 5 × 3.5 × 2.25 m. The source has coordinates (1.5, 2.75, 1.8), and the microphones (4.5, 1.25, 1.8) and (4.5, 2.25, 1.8).
Two different sets of frequency bins were used when applying (16). The first set contained frequency bins corresponding to the condition fk ϵ [100 Hz, 850 Hz]. The second set contained five non-overlapping frequency bands shown below. The choice of such frequency intervals was carried out in accordance with the form of power density spectrum of the raw signal shown in Figure 2. The presented characteristic was obtained by averaging all instantaneous power density spectrums with a window of N = 4096 samples. The position of the cut-off level was chosen empirically to optimize the TDE operation in the absence of reverberations. It should be noted that the power density spectrum for different speakers or even for different speech fragments by this speaker would not remain the same. However, the proposed procedure will remain applicable regardless.

Simulation of the Small Room Environment
As noted above, creating a realistic sound signal in accordance with (5) requires obtaining RIR functions ha (t), hb (t). The MATLAB program prepared by Eric Lehman [22] was used to obtain these characteristics. When calculating the RIR, the room parameters and the configuration of the sensors were specified as shown in position of the cut-off level was chosen empirically to optimize the TDE operation in the absence of reverberations. It should be noted that the power density spectrum for different speakers or even for different speech fragments by this speaker would not remain the same. However, the proposed procedure will remain applicable regardless.

Simulation of the Small Room Environment
As noted above, creating a realistic sound signal in accordance with (5) requires obtaining RIR functions ha (t), hb (t). The MATLAB program prepared by Eric Lehman [22] was used to obtain these characteristics. When calculating the RIR, the room parameters and the configuration of the sensors were specified as shown in Figure 3 The reverberation time (T 60 ) was assumed to be 50 msec and 120 msec. The first value is compliant with the standards of a room intentionally designed for voice broadcasting. The second value is compliant with the requirements for verbal communication in an office space [26]. The synthesized RIRs are shown in Figure 4.  The reverberation time (T60) was assumed to be 50 msec and 120 msec. The first value is compliant with the standards of a room intentionally designed for voice broadcasting. The second value is compliant with the requirements for verbal communication in an office space [26]. The synthesized RIRs are shown in Figure 4.   Table 2 shows the absolute TDE errors for various weight functions and the ideal signal propagation model. Figure 5 shows the dependence of TDE error on SNR.   Table 2 shows the absolute TDE errors for various weight functions and the ideal signal propagation model. Figure 5 shows the dependence of TDE error on SNR.    (15) and (16) provides greater accuracy while increasing the intensity of in-band noises. At the same time, the use of the second reduced frequency set allows you to reduce the threshold SNR to 4 dB over which sharp drop in the accuracy manifests. Figure 6 shows the absolute TDE error for SNR 8 dB for WPHAT and WML. When the noise intensity is not sufficient to go over the threshold, the estimators demonstrate the best possible performance in terms of accuracy regardless the noise level. When the SNR drops below the threshold level, the accuracy degrades gradually with the intensification of the noise. However, using a reduced set of frequency bins makes the contaminating effect of in-band noise less harsh. Notably, this is more obvious for WPHAT than for WML.    Figure 6 shows the absolute TDE error for SNR ≥ 8 dB for W PHAT and W ML . When the noise intensity is not sufficient to go over the threshold, the estimators demonstrate the best possible performance in terms of accuracy regardless the noise level. When the SNR drops below the threshold level, the accuracy degrades gradually with the intensification of the noise. However, using a reduced set of frequency bins makes the contaminating effect of in-band noise less harsh. Notably, this is more obvious for W PHAT than for W ML . That can be explained by the fact that frequency weighting applied with ML estimator compensates for frequency bins where noise prevails over the signal. Despite the fact, that threshold SNR level appears in Figure 6 to be better for PHAT than for ML, the latter estimator surpasses the former in terms of accuracy in the single path scenario regardless of noise intensity. The frequency weighting function for the ML estimator is in Figure 7.  Figure 7 shows the form of Фab (fk) and all W(fk) in the absence of noise (SNR = 32 dB) and their presence (SNR = 4 dB). A part of the curve that is close to linear shape is clearly distinguished at Фab, in both cases, however, in the presence of noise, the corresponding frequency range is significantly narrower. It should be noted that Фab in the absence of noise passes through the origin and behaves as described in [14]. However, when the signal is contaminated with the noise, Фab is offset relative to the abscissa axis. This can be explained by the fact that there is no voice signal on frequencies up to 100 Hz, so the prevalence of the noise in this band results in an unpredictable offset of the unwrapped phase spectrum. That makes the estimation technique proposed in [14] not relevant for this task.

Comparison of GPS TDE Methods in Anechoic Environment
The shape of WSCOT and WCOH is close to a line parallel to the time axis in the absence of noise. In the presence of noise, a high level of WSCOT and WCOH is observed in the intervals where the cross-power spectrum |Sab| has high values. WBCC form follows the shape of |Sab| and does not differ significantly in the presence of noise and their absence. Four areas of high values are visible at the WML corresponding to the Фab regions that are best approximated by the line.  Figure 7 shows the form of Φ ab (f k ) and all W(f k ) in the absence of noise (SNR = 32 dB) and their presence (SNR = 4 dB). A part of the curve that is close to linear shape is clearly distinguished at Φ ab , in both cases, however, in the presence of noise, the corresponding frequency range is significantly narrower. It should be noted that Φ ab in the absence of noise passes through the origin and behaves as described in [14]. However, when the signal is contaminated with the noise, Φ ab is offset relative to the abscissa axis. This can be explained by the fact that there is no voice signal on frequencies up to 100 Hz, so the prevalence of the noise in this band results in an unpredictable offset of the unwrapped phase spectrum. That makes the estimation technique proposed in [14] not relevant for this task.
The shape of W SCOT and W COH is close to a line parallel to the time axis in the absence of noise. In the presence of noise, a high level of W SCOT and W COH is observed in the intervals where the cross-power spectrum |S ab | has high values. W BCC form follows the shape of |S ab | and does not differ significantly in the presence of noise and their absence. Four areas of high values are visible at the W ML corresponding to the Φ ab regions that are best approximated by the line.

Comparison of GPS TDE Methods in Reverberant Environment
Tables 3 and 4 summarize the average TDE absolute errors for different weighting functions, reverberation model and different reverberation times.  Table 4. Absolute error of GPS TDE with reverberation model (T 60 = 120 msec).  Figure 8 shows that in the presence of reflected signals, the ML estimator is inferior in accuracy to the SCOT and COH estimators, especially in the absence of additive noises. At the same time, the accuracy turns out to be significantly lower than in the previous case. This can be explained by the correlation of the signals with their reflected copies. In the presence of reverberations and intense noises, none of the functions show any accuracy advantage. The latter makes it useful to apply the BPS TDE method (PHAT) as the simplest one.

SNR Mean Absolute Error (msec)
The use of the second set of frequency bins provides an advantage in accuracy only in conditions of noise dominance (SNR ≤ 0 dB). The use of the complete set of frequency bins provides significantly better accuracy in other cases. Figure 8 shows the dependence of TDE error on SNR graphically.    drastic increase in the error both in the presence and absence of noise. However, with the dominance of noise over the signal, the presence of reflected copies has a positive effect on accuracy. However, even if this is the case, the TDE error remains unacceptably high for a significant part of practical applications.
The use of the second set of frequency bins provides an advantage in accuracy only in conditions of noise dominance (SNR 0 dB). The use of the complete set of frequency bins provides significantly better accuracy in other cases. Figure 8 shows the dependence of TDE error on SNR graphically. Figure 9 shows the results of using GPS TDE for various acoustic conditions of the environment. It is clear from the figure that the reverberation time increase leads to a drastic increase in the error both in the presence and absence of noise. However, with the dominance of noise over the signal, the presence of reflected copies has a positive effect on accuracy. However, even if this is the case, the TDE error remains unacceptably high for a significant part of practical applications.  It can be seen from the form of Фab that an increase in the reverberation time leads to a distortion of the frequency response form and a decrease in the estimate accuracy. At the same time, the distortions observed for WSCOT and WCOH are not as significant as they were in the absence of reverberations and the presence of noises. This can be explained by the fact that the reflected signals are mutually correlated, and their presence does not contribute to a significant decrease in the level of signal coherence. The correlation of the reflected signals also affects at the shape |Sab| and, therefore, at the WBCC form. The WML form also changes significantly with an increase in the reverberation time, while the regions of high values also correspond to the linear sections Фab. At T60 = 120 msec, the number of such sections becomes smaller which negatively affects the accuracy.  Figure 10 shows the form of Φ ab (f k ) and all W(f k ) for different values of reverberation time (T 60 ). All graphics in Figures 7 and 10 are obtained for one and the same fragment of the original signal. It can be seen from the form of Φ ab that an increase in the reverberation time leads to a distortion of the frequency response form and a decrease in the estimate accuracy. At the same time, the distortions observed for W SCOT and W COH are not as significant as they were in the absence of reverberations and the presence of noises. This can be explained by the fact that the reflected signals are mutually correlated, and their presence does not contribute to a significant decrease in the level of signal coherence. The correlation of the reflected signals also affects at the shape |S ab | and, therefore, at the W BCC form. The W ML form also changes significantly with an increase in the reverberation time, while the regions of high values also correspond to the linear sections Φ ab . At T 60 = 120 msec, the number of such sections becomes smaller which negatively affects the accuracy.  Figures (a,c,e,g,i) are obtained for T60 = 50 msec. Figures (b,d,f,

Conclusions
This study investigated GPS TDE in relation to the problem of localizing a sound source in a small room. The suggested TDE algorithm is based on the analysis of the phase response form which makes it possible to estimate the time by analyzing an arbitrary set of spectral bins.
To assess the algorithm's applicability and efficiency, a series of computational experiments were performed to simulate the speaker positioning within a small room. To simulate room acoustics, the image model implemented by Lehman and Johanson [23] was used. During the course of the experiment, the SNR at the signal receivers was varied, as well as the room reverberation time.
The fundamental applicability of the suggested algorithm was shown due to the performed experiment. In the absence of noises and echo, GPS TDE demonstrates an accuracy comparable to the sampling error at f d = 44,100 Hz (about 0.01 s). A decrease in accuracy is expected in the absence of echo but at an increase in the intensity of additive noise. However, narrowing of the frequency range over which TDE is performed helps to maintain accuracy under moderate noises (SNR > 4 dB). The best accuracy characteristics are provided by the ML GPS estimator.
When an echo occurs, TDE accuracy downgrades significantly. The reflected signals are correlated, and, therefore, introduce extra noise to the correlogram. In this case, the use of a reduced set of spectral bins affects the accuracy negatively. Even with insignificant reverberations, corresponding to an acoustical very dead room and the absence of noises, the ML GPS estimator demonstrates a relatively low accuracy. The SCOT and COH GPS estimators show the best results. In conditions of higher reverberations, the TDE error increases significantly in comparison to the ideal case and makes the use of the GPS method ineffective. In practice, however, the influence of echo can be lower, as real-world microphones are not omnidirectional.
Even though the suggested method is inferior to analogs in a few aspects, its advantage remains high computational efficiency. The suggested computational scheme, when using a relatively small number of adjacent frequency samples for TDE, allows the use of Goertzel's algorithm instead of FFT [27]. This is essential for embedded computers with memory constraints. Additionally, the use of well-known implementations of the Goertzel algorithm designed for phase detection [28] will make it possible to re-evaluate the spectral characteristics of the signal with new data arrival. The latter is useful for solving the problem of positioning a mobile acoustic source. Further studies will be devoted to the testing of this hypothesis.