Comparison of Direct Intersection and Sonogram Methods for Acoustic Indoor Localization of Persons

We discuss two methods to detect the presence and location of a person in an acoustically small-scale room and compare the performances for a simulated person in distances between 1 and 2 m. The first method is Direct Intersection, which determines a coordinate point based on the intersection of spheroids defined by observed distances of high-intensity reverberations. The second method, Sonogram analysis, overlays all channels’ room impulse responses to generate an intensity map for the observed environment. We demonstrate that the former method has lower computational complexity that almost halves the execution time in the best observed case, but about 7 times slower in the worst case compared to the Sonogram method while using 2.4 times less memory. Both approaches yield similar mean absolute localization errors between 0.3 and 0.9 m. The Direct Intersection method performs more precise in the best case, while the Sonogram method performs more robustly.


Introduction
Acoustic localization systems can provide, partly due to the comparably slower wave propagation, a high accuracy indoors similar to radio-based solutions, which are not covered by ubiquitous satellite signals of Global Navigation Satellite Systems (GNSS) [1][2][3]. For some applications, it may not be desirable to equip persons or objects with additional hardware as trackers due to inconvenience and privacy reasons. Previously, we reported coarsely about indoor localization by Direct Intersection in [4]. In this work, we report in detail on two algorithms for this application and their performances. The proposed system is categorized as a passive localization system [5] and is implemented solely with commercial off-the-shelf (COTS) hardware components.
Echolocation, such as the method used by bats to locate their prey, is a phenomenon where the reflected sound waves are used to determine the location of objects or surfaces that reflect the sound waves due to a change in acoustic impedance. This concept has been extensively used for various investigations in the physics and engineering fields, such as sound navigation and ranging (Sonar) [6,7] and even using only a single transducer for transmission and reception [8].
We draw the approach from bats, which can perceive the incoming reflected wave's direction due to its precise awareness of head angle, body motion, and timing. While the exhaustive echolocation method of bats is not completely understood, one of the more obvious aspects is the back-scattered signals' difference of arrival in time between left and right ears, which can be used to calculate the incoming sound wave's direction [9]. This approach differs from approaches that more generally detect changes in the systems response of a medium, where the responses act like fingerprints. However, in application, insignificant changes in a room may lead to distortions in the response. This makes a better knowledge of the specific room necessary. In contrast, determining times-of-arrival of back-scattered waves is less dependent on the complete impulse response; we therefore chose this approach. We investigate two different algorithms based on the time difference of arrival of the first-order reflection to interpret the returned signals in a small office room of approximately 3 m × 4 m × 3 m similar to [10], which are characteristic for the strong multipath fading effects that partially overlap and interfere with the line-of-sight reverberations [11]. The signal frequency employed in our experiment is significantly higher than the Schroeder frequency; therefore, we can assume the sound wave behaves much like rays of light [12]. The physiological structure and the shape of the binaural hearing conformation of bats, together with the natural and instinctive ability to perform head movements to eliminate ambiguities, enhances the echolocation and therefore guarantees excellent objects spatial localization [13]. Our system setup is a fixed structure, and we compensate the adaptive bats head movements by adding two additional microphones to the system. Furthermore, we raise the question of the performance of two approaches and compare the memory consumption and execution time.
The detection of more than one person or object is not investigated in this work.

Related Work
Indoor presence detection may be achieved through a variety of different technologies and techniques. For one, radio-frequency (RF)-based approaches have been implemented. In general, these may be classified into two different employed techniques: received signal strength indicator (RSSI)-and radio detection and ranging (Radar)-based approaches. The former offers low-complexity systems with cheap hardware [14,15], whereas with the latter one, higher accuracy may be achieved [16]. The other main concept employed in indoor presence detection is using ultrasonic waves, which are applied in active trackers indoors [17,18] and even underwater [19,20]. An entirely passive approach, as in [21], generally analyzes audible frequencies, which can include speech and potentially violate privacy regulations, similar to vision-based approaches. Acoustic solutions, which operate close to or in the audible range, can be perceived by persons and animals alike, which may cause irritation and in the worst case harm [22]. Therefore, special care has to be invested in designing acoustic location systems. While radio-based solutions are less critical in this concern, due to the fact that most organisms lack sensitivity to radio frequency signals, the frequency allocation is much more restrictive due to licensing and regulations. While LIDAR systems are highly accurate, but comparably costly, other light-based systems have gathered interest again, due to their high accuracy potential, with low systems costs and power consumption [23].

RF-RSSI
Mrazovac et al. [24] track the RSSI between stationary ZigBee communication nodes, detecting changes to infer a presence from it. In the context of home automation, this work is used to switch on and off home appliances. Seshadri et al. [15], Kosba et al. [14], Gunasagaran et al. [25], and Retscher and Leb [26] analyze different signal strength features for usability of detection and identification using standard Wi-Fi hardware. Kaltiokallio and Bocca [27] reduce the power consumption of the detection system by distributed RSSI processing.
This technique was then improved by Yigitler et al. [28], who built a radio tomographic map of the indoor area. The difference from the previously sampled map of RSSI values is the notification of a presence or occupancy. This general concept is known in the field of indoor localization as fingerprinting. Hillyard et al. [29] utilize these concepts to detect border crossings.

RF-Radar
Suijker et al. [30] present a 24 GHz FMCW (Frequency-Modulated Continuous-Wave) Radar system to detect indoor presence and to be used for intelligent LED lighting systems. An interferometry approach is implemented by Wang et al. [16] for precise human tracking in an indoor environment. Another promising approach in the RF domain is, instead of using a time-reversal approach (as Radar does), deriving properties of the medium (and contained, noncooperative objects) by means of wave front shaping as proposed by del Hougne et al. [31,32]. This approach would also in principle be conceivable in the acoustic wave domain.

Ultrasonic Presence Detection and Localization
A direct approach to provide room-level tracking is presented by Hnat et al. [33]. Ultrasonic range finders are mounted above doorways to track people passing beneath. More precise localization can be achieved by using ultrasonic arrays as proposed by Caicedo and Pandharipande [9,34]. The arrays' signals can be used to obtain the range and direction-ofarrival (DoA) estimates. The system is used for energy-efficient lighting systems. Pandharipande and Caicedo [7] enhanced this approach to track users by probing and calculating the position via the time difference of arrival (TDoA). Prior to that, Nishida et al. [35] proposed a system consisting of 18 ultrasonic transmitters and 32 receivers, embedded in the ceiling of a room with the aim to track elderly people and prevent them from experiencing accidents. A time-of-flight (ToF) approach was proposed by Bordoy et al. [36], who used a static co-located speaker-microphone pair to estimate human body and wall reflections. Ultrasonic range sensing my be combined with infrared technology, as has been done by Mokhtari et al. [37], to increase the energy efficiency. In lower frequency regimes, the resonance modes of a room start to dominate the measured signals. This fact may be used to deduce source locations as proposed by Nowakowski et al. [38] (cf. [39,40]).

Ultrasonic Indoor Mapping
Indoor mapping and indoor presence detection are two views of the same problem. In both instances, one tries to estimate the range and direction for a geometrical interpretation. Ribeiro et al. [41] employ a microphone array co-located to a loudspeaker to record the room impulse response (RIR). The multiple reflections can be estimated from this RIR with the use of l 1 -regularization and least-squares (LS) minimization, and a room geometry can be inferred, achieving a range resolution of about 1 m. A random and sparse array of receivers is proposed by Steckel et al. [42] for an indoor Sonar system. In addition to that, the authors use wideband emission techniques to derive accurate three-dimensional (3D) location estimates. This system is then enhanced with an emitter array to improve the signal-to-noise-ratio (SNR) [43]. Another approach, implementing a binaural Sonar sensor, is proposed by Rajai et al. [44]. A sensor was used to detect the wall within a working distance of one meter. In a recent work by Zhou et al. [45], it is shown that a single smartphone with the help of a gyroscope and an accelerometer can be used to derive indoor maps by acoustic probing. Bordoy et al. [46] use an implicit mapping to enhance the performance of acoustic indoor localization by estimating walls and defining virtual receivers as a result of the signals' reflections.

Algorithms
The first set of methods, which are broadly applied are triangulation algorithms as described by Kundu [47]. In this work we focus on two Maximum-Likelihood approaches, similar to the one proposed by Liu et al. [48]. The first one, Direct Intersection (DI), uses a Look-up- Table (LUT) and spheres inferred from the sensors delay measurements with error margin [49], while the other one, the Sonogram method, populates a 3D intensity map with probabilities to find likely positions of the asset. Since the approaches of the two methods are different, it is likely to expect different outcomes in accuracy, precision, computational complexity, and memory requirements.

System Overview
The system consists of a single acoustic transmitter, a multi-channel receiver, a power distribution board, and a central computer to analyze the recorded signals. Four microphones are placed equidistantly around the speaker and connected to the receiver board. The set-up is shown in Figure 1 as it was used for the experiment reported below.

Signal Waveform
Due to their auto-correlation properties and the ability to maximize the Signalto-Noise-Ratio (SNR) without increasing acoustic amplitude, swept-frequency cosine, i.e., frequency modulated chirp signals, perfectly fit our case-study [50]. Auto-correlated frequency-modulated chirps are able to provide compressed pulses at the correlator output, whose width in time space is defined as follows [51]: The frequency-modulated signal employed in our experiments, x Tx (t), is mathematically defined as follows: where A denotes the signal amplitude, f start is the start frequency, f end the end frequency, B = f end − f start the frequency bandwidth, T s is the pulse duration, and φ(t) the instantaneous phase. The chirp instantaneous frequency is defined as follows: Taking into account the hardware characteristics of our setup, we selected a linear up-chirp pulse with amplitude A = 1, T s = 5 ms, f start = 16 kHz, and f end = 22 kHz, which result in a time-bandwidth product of T B = 30. The frequency response of a chirp signal directly depends on the Time-Bandwidth (T B) product. For chirps with T B ≥ 100, the pulse frequency response is almost rectangular [52]. However, due to the hardware limitation of our setup, which do not allow a high (T B) product, the frequency response will be characterized by ripples. In order to mitigate the spectrum disturbances, we consider a window in the time domain the transmitted chirp pulse with a raised cosine window [52]. The frequency band, chirp length, and shaping window were chosen to minimize the system affecting persons and animals in hearing range. We implemented chirps, due to their property of spreading the signals energy over time compared to a single pulse to limit the maximal amplitude and resulting harmonics. While young and highly audio-sensitive people can in principle hear these frequencies, the short signal length of 5 ms compared to the repetition interval of 1000 ms further reduces the occupation of the low ultrasonic channel. Generally speaking, higher amplitudes and lower frequencies potentially increase the operation range of the system, but this comes at a health risk for humans and animals, which we seek to avoid.

Hardware Overview
To obtain 3D coordinates with static arrangement, a four-element microphone array is sampled, as well as a feedback signal. This array records the incoming echo wave with different time of arrival, depending on the incoming signal direction. Since unsuitable hardware can affect the system's performance [53], both the microphones and speaker were tested for correct signal generation and reception in an anechoic box.

Data Acquisition
Each microphone's signal was preconditioned before the digitization by the multichannel analog-to-digital converter, which was chosen to provide each channel with the identical sample-and-hold trigger flank before conversion. Each frame consists of the signal from each microphone and a feedback, which is recorded as an additional input to estimate and mitigate playback jitter. The first layer of digital signal processing is to compress the signal, extracting the reverberated acoustic amplitude over time and removing the empty room impulse response (RIR).

Channel Phase Synchronization
Initially, we calculate the convolution of the feedback channel signal s fb with our known reference signal s ref in its analytic form to obtain the RIR and retrieve the time of transmission from the compressed signal y fb , as shown in Equation (5), where j denotes the imaginary unit.
This compressed analytic form y fb of the feedback signal s fb (see Figure 1) ideally holds only a single pulse from the transmitted signal, if the output stage is impedance matched. Searching for the global maximum returns both time of transmission and the output amplitude.
In the following, we refer to the start time of a transmission as t 0 , all other channels' time scales are regarded relative to t 0 . Therefore, the signals of the microphone channels are truncated to remove information prior to the transmission. The ring-down of small office rooms is in the order of 100 ms, so the repetition interval of consecutive transmissions is chosen accordingly to be larger. This prevents leakage of late echos into the following interval, which would result in peaks being recorded after the following interval's line-ofsight. The remaining signal frames from all microphones are compressed with the same approach as the feedback channel, shown in Equations (5) and (6), to extract each channel's compressed analytic signal y i and line-of-sight detection time t i .

Baseline Removal
In the following, we refer to the acoustic channel response after the line-of-sight as the echo profile. An example of such echo profiles is shown in Figure 2. While the line-ofsight signal ideally provides the fastest and strongest response, large hard surfaces, like desks, walls, and floors return high amplitudes, which are orders of magnitude above a person's echo. For a linear and stable channel, we can reduce this interference from the environment by subtracting the empty room echo profile from each measurement, following the approach of [54]. This profile loses its validity if the temperature changes, the air is moving, or objects in the room are moved, e.g., an office chair is slightly displaced. A dynamic approach to create the empty room profile is updating an estimation, when no change is observed for an extended time or alternatively using a very low-weight exponential filter to update the room estimation. In this work, the empty office room was sounded N times directly before each test and averaged into an empty room echo profileȳ • i for each channel i as denoted in Equation (7), to assure unchanged conditions and reduce the complexity of the measurements. The removal itself is then, as mentioned above, the subtraction of the baseline from each measurement, as in Equation (8), under the assumption of coherence.ȳ

Time-Gating
For our approach we assume some features of the person, such as being closer to the observing system compared to the distant environment objects, like chairs, tables and monitors, while another area of reverberations is in the close lateral vicinity of the system, consisting, e.g., of lamps and the ceiling. This is exploited by introducing a time gate, which only allows for non-zeros values in the interval of interest as in Equation (9) (also compare Figure 2).ỹ Another assumption is that of a small reverberation area on the person. We assume the points of observation from each microphone to be sufficiently close on a person to overlap. The latter assumption introduces an error, which limits the precision of the system in the order of 10 cm [55], which we deem sufficient for presence detection, as a person's dimension is considerably larger in all directions. This estimation is based on the approximate size of a person's skull and its curvature with respect to the distance to the microphones and their spacing. The closer the microphones and the further the distance between head and device, the more the reflection points will approach each other. If we regard a simplified 2D projection, where a person with a spherical head of radius r H ≈ 10 cm moves in the y-plane only, the position of a reflection point R = (x R , z R ) on the head can be calculated by where x C and z C are the lateral and vertical center coordinates of the head and α R is the reflection angle. The latter is calculated through with the distance d M between the microphone and sender. The origin is set as the speaker position. By geometric addition, the distance between two such reflection points can be calculated and reach the maximum value if the head moves towards the center. In this case, the reflection points would be on the opposing sides of the head and result in a mismatch of 2 r h . The other extreme is laterally moving to a infinite distance, which increases the magnitude of x C , while the distance between microphone and speaker stays constant; therefore, the reflection points converge to a single point of reflection. In this work, the distance between head center and speaker remained above 120 cm, with a projected error distance of about 1.3 cm.

Echo Profile
During the experiment, the reflected signals from the floor, walls, tables, and chairs have a very high amplitude. This interference can lead to masking the echo from the target object. To reduce the effect of the interference, the empty room profile is used to subtract the target impulse response from the input impulse response. If we define the reflection from objects other than the target object as noise, we can increase the signal-to-noise ratio with this method. The empty room impulse response is also called empty room echo profile in this work. In Figure 2, the upper plot is the empty room impulse response, where the experiment room is cleared of most clutter. The middle plot is the room with single static object as target, shown in Figure 3. The lower plot shows the result of subtraction between the the second and first plot, and the scale is adjusted for clarity.

Distance Maps
Look-up tables are calculated before the experiment to estimate the travel distance of a signal from the speaker to each microphone under the assumption of a direct reverberation from a point at position x in the room and linear beam-like signal propagation. This grid is formed by setting the center speaker as origin and spanning up a 3-dimensional Cartesian coordinate system of points x through the room in discrete steps. We limit the grid to the intervals X 1 to X 3 in steps of 1 cm to decrease the calculational effort and multipath content under the prior knowledge of the rooms geometry as follows: The look-up table approach serves to minimize the processing time during execution. The distance maps provide pointers to convert from binary sampling points to distance points. Each sub-matrix contains the sum of distance between each point in the room to the corresponding ith microphone at the position x M,i and to the speaker at position x S , which cover the flight path of the echoes, as in Equation (13): Therefore, the resultant entries in matrices M depend on the geometric arrangement of speaker and microphones, and the matrix size corresponds to the area of detection, as in Equation (12).

Direct Intersection
The main assumption for this approach (Algorithm 1) is that the highest signal peak in the observation window of each channel indicates the position of interest, as visualized in Figure 2. Each channels' peak index defines the radius r i of a sphere around each microphone, which is contained in the point cloud L i . While ideally those spheres overlap in exactly the point of reverberation, in practical application, where noise, interference, and jitters are present, this is not the case. To compensate this error, we pad the sphere by ∆r additional points in the radius until all spheres overlap and the unity of valid estimation points U L is not empty. The sphere radius widening ∆r can be used as an indication of each measurement's quality, as a low error case will require little to no padding, while in high-error cases, the required padding will be large. Another approach is to use a fixed and small padding, which will ensure only measurements of high quality to be successful, but will fail for high error scenarios.

Algorithm 1: Direct Intersection Estimation.
Input :ỹ tg , observed intensity data frames, M i , distance maps, K, number of channels, ∆r max , maximum radius spreading distance. Output : x est , estimated 3D-position.

Sonogram
The Sonogram approach (Algorithm 2) leverages available memory and processing power to build a 3D intensity map. This approach utilizes the entire echo profile difference shown in Figure 2 (bottom) and maps them into the 3D distance map explained in Section 3.3.5, with the assumption that the highest peak corresponds to the source of reverberation. The multiplication of impulse amplitude that corresponds to the same coordinates is used as an indication of possible reverberation source. Therefore, the maximum result would have the highest likelihood of being the reverberation source location.

Set-Up
In the experiment, we use a mock-up representing a person's head as the experiment target. The hard and smooth surface of the object is intentional for the sake of usability and to remove unintended movements from our measurements at this early stage. In the set-up shown in Figure 3, the central speaker emits the well-known signal s tx , and the reflected echoes from the target s 1 to s 4 are recorded by the microphone array around the speaker. The depiction in Figure 3 is exaggerated for clarity. Table 1 shows the spherical coordinates, i.e., radial distance r, azimuth angle θ, and elevation angle φ of the target inside the room, with the center of the device as the reference point. The device is positioned on the ceiling, oriented downward. For each position, we measure the distance for the assumed acoustic path with a laser distance meter Leica DISTO TM D3a BT for reference. As mentioned above, the coordinate system's point of origin is set to the center of the device, the x-axis is set perpendicular to the entrance door's wall, and increasing towards the right, the y-axis is parallel to the line of sight from the door and increasing towards the rear end of the room, and the z-axis is zero in the plane of the device (upper ceiling lamp level) and decreasing towards the floor. The two-dimensional depictions are shown in Cartesian coordinates to provide clarity, while the detection results are done in spherical coordinates.

Room Properties and Impulse Response
In preparation for the later experiments, we sounded the room 100 times as described in Section 3.3.2 to record the baseline profiles shown in Figures 4 and 5. This recordings were taken one time and served as a reference for all later experiment runs. During the recordings, the room was left closed and undisturbed.  The room exhibits a different room response for each microphone, as illustrated in Figure 4. We divide the response into four parts: line-of-sight, free space transition, first order echoes, and higher order echoes, i.e., coda [60]. The signal remains in the room for more than 100 ms, before it drops below the noise floor. The definition of the reverberation time from Sabine requires a drop of the sound levels below −60 dB [61,62], for which the low signal-to-noise ratio of less than 24 dB does not suffice. Therefore, we adapted a fractional model and extrapolated the reverberation from a drop of 20 dB. The resulting mean reverberation time of the room is approximatelyT rev ≈ 445 ms, which corresponds to a dampening factor δ ≈ 15.5 s −1 and a Schroeder frequency of approximately f sch ≈ 230 Hz, which is far below the transmission band. In this work, we focus on the response in the parts-free space transition and first-order echoes to estimate a person's position. A close-up of the first three parts of the room response is shown in Figure 5.
The recordings still show significant variances in each channel at varying positions, e.g., in the uppermost subplot of Figure 5 from 15 to 16 ms. Below 8 ms, these intervals with increased variances do not occur, indicating a stable channel. The signals' interval close to zero contains strong wall and ceiling echos. Note the very strong reverberation peak at 12.5 to 13.5 ms that is caused by the floor. As our area of interest does not fall within this distance, we omit it for analysis as well. Hence, the time-gate limits as introduced in Section 3.3.3 are t min = 3 ms and t max = 8 ms.
If we transfer the room dimensions into the wavelength space, hence with c as the speed of sound and l the room dimensions in the respective Cartesian direction, we can draw an estimator from [63] for the number of modes below the reference frequency f g as This lets us calculate approximately 15 × 10 6 modes below 16 kHz and 40 × 10 6 modes below 22 kHz, which leaves about 25 × 10 6 modes in the sounding spectrum in-between. If we regard the number of eigenfrequencies below the Schroeder frequency, Equation (15) yields N sch ≈ 73 modes that strongly influence the sound characteristics of the room [64].

Direct Intersection
The localization by Direct Intersection from all 100 runs is shown for each of the four reference positions in Figure 6. While the statistical evaluation is performed in spherical coordinates due to the geometric construction during the estimation, this overview plots, as well as those for the Sonogram localization are drawn in Cartesian coordinates that allow for easier verification and intuitive interpretation. The lateral spread of the estimation point cloud in Figure 6 1 is misleading as the points are situated on a sphere around the origin. The projected lateral extent is almost entirely due to the angular errors. Positions 1 and 2 show a distance estimation deviation of σ r ≈ 10 cm, as well as azimuth and elevation angle errors of σ`≈ σ OE < 5°for both Direct Intersection and Sonogram localization (compare Tables 2 and 3). For positions 3 and 4 , which are situated closer to the desks, the deviation increases to almost 40 cm in distance and almost arbitrary azimuth angles with a σ`≈ 120°and more, but a far less affected elevation angle estimation with a σ`< 10°. The deviations are calculated around the mean estimator for each value. For simplicity of interpretation, the mean error for each dimension is shown in Section 4.2.3. The error distributions for each dimension are shown in Figure 7, where each column depicts one of the spherical dimensions (radius, azimuth angle, and elevation angle), while each row represents the results from the reference position indicated to the left of the plot. For the first two positions, the distributions are almost unimodal, but for the latter two, this does not hold true, making the mean value and standard deviation unsuitable estimators. The distribution of the error in the absolute distance between the estimated positions and reference positions (see Figure 8) is likewise a few dozen centimeters for the first two cases, but around 1 m for the latter two. If we recall the reference positions from Table 1, the true distances are between 1 and 2 m, which puts the error in the same order as the expected value. The Direct Intersection method allows for an investigation into the time variance of the detected maximum peak, which is depicted in Figure 9. In the first two cases, we observe unimodal distributions of around 10 samples in width, while the latter cases show detected peaks all over the interval.

Sonogram
The Sonogram localization on the same data as before in Section 4.2.2 is shown in Figure 10 for all four cases. The lateral distribution of the estimated locations is not following the spherical shape as closely as is the case for those by Direct Intersection estimations (compare, e.g., Figure 6 1 ). Figure 10. The same position estimation plot as in Figure 6 for positions 1 to 4 but by Sonogram. The reference position is given by the green diamond, the averaged estimation by the red cross, and each circle represents a single estimated position. The circles' infill is proportional to the observed intensity. Similar to before, the method performs well in the cases 1 and 2 , exhibiting small deviations (see Table 3), but far less precise with the largest deviation increase in the azimuth angle as well. The corresponding mean errors to the reference positions are listed in Table 4.  The cases 3 and 4 display two larger clusters of estimated positions, which leads to the bimodal error distributions in Figure 7.

Direct Intersection
The absolute error is similarly distributed around lower values for the former two cases and widely spread for the two latter cases (see Figure 8). Note that the error distribution plots for the Sonogram are of slightly different horizontal scale, as no errors below 20 cm were observed, while the observed maximal error exceeds 200 cm.
Lastly, the performance of both algorithms with regard to execution time is listed in Table 5 and mean required memory in Table 6. The distribution of those measures is shown in Figures 11 and 12. The Direct Intersection method requires roughly 2.4× less memory than the Sonogram localization. With a best-case mean execution time of 0.66 s, the former algorithm is almost 1.7× faster than the best case mean of the latter method, while the worst-case mean-almost unchanged for the Sonogram approach-is with a factor of 7.1 for the Direct Intersection by far slower than the worst case mean execution time of the Sonogram method.   The Direct Intersection execution time varies strongly, as we observe it anywhere between 0.25 and 25.0 s; thus, without further limitations, it does not allow for a wellconfined prediction of the localization algorithm's execution time.

Localization
The localization methods discussed in Section 4 are based on the time of arrival of the line-of-sight reflection from the target. This is possible because the frequency-modulated signal in our experiments is significantly higher than the Schroeder frequency of the room. The Direct Intersection method provides throughout all cases distance estimations that are too short, while the Sonogram-based localization returns distance estimations that are longer than the reference (compare Figure 7). Regarding the absolute error distribution, we observe that the Direct Intersection method performs more accurately, especially in the better cases 1 and 2 , as well more precise in the first three of the four observed cases, as drawn from Figure 8. The possible cause of the degradation of both methods performance for cases 3 and 4 is in the peak detection algorithm, as Figure 9 shows a wide error range of detected possible peaks. While this was observed specifically for the Direct Intersection method, this also implies the low signal-to-noise ratio of the underlying echo profile, and consequently also affects the Sonogram estimation. Interestingly, the lower estimation errors for cases 1 and 2 implicate a better performance for the larger distances than the closer ones, which is counter-intuitive from a power perspective, but if we recall the empty room impulse responses shown in Figure 5, where noise is included as the curves' variance, and compare it to the magnitudes of a person's signal in Figure 2, the difference in magnitude is in the same order. For higher distances, the variance increases, as fluctuations in the speed of sound cause phase distortions, but for lower distances, interference effects dominate. The frequency band of the chirp between 16 and 22 kHz sets the wavelength range to approximately 2.2 to 1.6 cm, which is close to the distance between reflection points on a person's head, as shown above in Section 3.3.3. Proximity to objects increases interference as well, which explains the lower performance in the closer positions 3 and 4 , where the projected distance onto the sensor system's aperture between the person and the wall, screen, and desk is reduced. If we regard the error distributions of each position in Figure 7 again, the angles and distances roughly fit non-line-of-sight paths, especially for the Sonogram method.

Performance
The Direct Intersection method requires less than half the memory for its computations compared to the Sonogram method, as the information is very early condensed in the peak selection part of the algorithm. The index look-up is in itself a cheap operation, but due to the sphere-spreading loop to decrease the probability of the algorithm not returning any valid position at all, comes at higher execution duration. The observed worst case for Direct Intersection is with 25 s so high that no real-time tracking is possible anymore. If we look closer at Figure 6 3 , we see that the estimation point gray scale infill is proportional to the inverse spreading factor, so darker colors mean less radial spread before intersecting points could be found. The notion that including strong outliers by allowing the sphere thickness to be spread so far is not confirmed if we consider Figure 6 4 .

Conclusions
Both methods show mean distance estimation errors ranging between approximately 0.3 and 0.9 m for objects in distances between 1.2 and 1.7 m, with angular errors between 2°a nd 138°in azimuth, 1°and 7°in elevation. The Sonogram Estimation allows for analysis of room response in more detail, and the results are more accurate (i.e., average error) in three out of four observed cases, but inversely, the precision (i.e., error variance) of the Direct Intersection is higher in three of the cases. The Direct Intersection method allows for less expensive computation by reducing maximum radius spreading, while the Sonogram method's cost can be reduced effectively by limiting the vertical search interval, e.g., to the clutter free area above the desks. For a full-range sounding of the room, we observed that the locations close to the clutter area are estimated worse regarding both accuracy and precision. For a pragmatic operation on hardware with higher memory limitations the Direct Intersection method will perform faster and with similar precision and accuracy, and can be limited in execution time by restricting the sphere radius spreading at the cost of not being able to estimate the position for several intervals. We esteem further investigation into limiting the degradation of the estimation process by single unreliable channels as most promising for improving passive acoustic indoor localization.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: