Binaural Modelling and Spatial Auditory Cue Analysis of 3D-Printed Ears

In this work, a binaural model resembling the human auditory system was built using a pair of three-dimensional (3D)-printed ears to localize a sound source in both vertical and horizontal directions. An analysis on the proposed model was firstly conducted to study the correlations between the spatial auditory cues and the 3D polar coordinate of the source. Apart from the estimation techniques via interaural and spectral cues, the property from the combined direct and reverberant energy decay curve is also introduced as part of the localization strategy. The preliminary analysis reveals that the latter provides a much more accurate distance estimation when compared to approximations via sound pressure level approach, but is alone not sufficient to disambiguate the front-rear confusions. For vertical localization, it is also shown that the elevation angle can be robustly encoded through the spectral notches. By analysing the strengths and shortcomings of each estimation method, a new algorithm is formulated to localize the sound source which is also further improved by cross-correlating the interaural and spectral cues. The proposed technique has been validated via a series of experiments where the sound source was randomly placed at 30 different locations in an outdoor environment up to a distance of 19 m. Based on the experimental and numerical evaluations, the localization performance has been significantly improved with an average error of 0.5 m from the distance estimation and a considerable reduction of total ambiguous points to 3.3%.


Background
In the field of acoustics and robotics, it would require a minimum of three microphones to triangulate a sound source in a two-dimensional (2D) space [1,2]. With only two omnidirectional microphones, the microphone fields would intersect at two points, implying that there could be two possible locations where the sound could originate from. Having the third microphone will reveal the unique position of the sound source by eliminating the other possibility. Despite only having two ears. i.e., binaural hearing, humans and animals are able to localize a sound source not only in a 2D space, but also in a three-dimensional (3D) space by analyzing different auditory cues. The brain deciphers the audio cues to predict the direction and distance of the sound source [3]. The job of the ears is to capture and send natural acoustic signals to the brain for processing. The shape of the ear and head additionally plays a role in localizing the sound source by reflecting and diffracting the sound to help the brain to identify the direction [4]. Gaming and recording industries have begun using ear shaped recording devices to make binaural recordings giving a more natural hearing experience for the listeners [5]. In the gaming industries, having binaural audio enables the player to identify where a sound source is coming from in the game in order to give the listener a virtual sense of space [6].
Acoustic triangulation is based on the physical phenomena that sound waves are longitudinal when in far field and if the source is not exactly at the center of all microphones, there will be a time delay between the first microphone and subsequent microphones [7]. Specifically in binaural hearing, this is known as the Interaural Time Difference (ITD) [8]. ITD is the difference in the arrival time of a sound between two ears. It is crucial in the localization of sounds, as it provides a cue to the direction or angle of the sound source from the head. The brain would register the time lag and inform the listener of the direction of sound [9]. ITD analysis is one of the techniques used to predict the angle of arrival of the sound source with respect to the receiver on the azimuth plane.
The Interaural Level Difference (ILD) is another spatial auditory cue that helps a human to localize a sound source. ILD is defined as the difference in amplitude between the two ears [10]. When a sound source is closer to one ear, the sound level will be louder in one ear than the other, as sound is attenuated by distance and also the head. The direction of the sound source can be localized by comparing the level difference between the two ears. ITD/ILD is primarily used for low frequency localization under 800 Hz. The head shadow effects increase with frequency, and therefore loudness differences are the primary horizontal localization cues for frequencies above 1500 Hz [11]. In many applications, ILD and ITD are used in tandem for a more accurate position estimation on the horizontal plane.
Vertical localization is essential if one is to estimate the position of a sound source in a 3D space, and for binaural auditory systems, this can only be realized with the existence of the ears. The shape of the ears is in such a way that the amplitude and frequency response changes, depending on where the sound source is located on the azimuth plane. The pinna which is the outer part of the ear acts as a filter to attenuate certain frequency ranges and has a major role in helping the human auditory system to localize the angle and distance of the sound source [12]. Since the shape of the pinna is very complex and asymmetrical, different pinna resonances become active both vertically and horizontally, depending on the location of the source. These resonances add direction-specific patterns into the frequency response of the ears, which is then recognized by the auditory system for direction localization [13].

Related Work, Motivation and Contributions
In a binaural system, there exists a set of points that are equidistant from left and right ears, which results in ILD values that are almost identical, and creating a zone called the cone of confusion. This sound ambiguity that is typically referred to as "frontrear" confusion commonly occurs when localizing a sound source with binaural audition, where it is difficult to determine whether the source is behind or in the front of the receiver [14]. One way to resolve this is by introducing dummy heads, such as the Head and Torso Simulator (HATS) with a microphone placed inside each ear canal to allow for the creation of various acoustic scenes [6]. Via this strategy, several researchers proposed Head-Related Transfer Function (HRTF) estimations, where the transfer functions that are based on left and right ears were preliminarily analyzed and constructed before the binaural model is reproduced [15]. However, some techniques via HRTF have been experimentally outperformed via another approach in [16], which used artificial pinnae made with silicone in order to remove the ambiguity by comparing the mean intensity of the sound source signal in a specific frequency range using a threshold value.
For binaural localization in targeted rooms, statistical relationships between sound signals and room transfer functions can be analyzed prior to real-time location estimations, such as the work presented in [17]. The accuracy can be further enhanced by jointly estimating the azimuth and the distance of binaural signals using artificial neural network [18,19]. Another approach utilizing the room's reverberation properties has been proposed in [20], where the reverberation weighting is used to separately attenuate the early and late rever- berations while preserving the interaural cues. This allows the direct-to-reverberant (DRR) energy ratio to be calculated, which contains the information for performing absolute distance measurement [21].
The interest in most of the aforementioned work is nonetheless gravitated to sound source localization and disambiguation on the horizontal or azimuth plane, with greater focus on indoor environments. In order to estimate the vertical direction of the sound source, spectral cue analysis is required, as the direction heavily depends on the spectral composition of the sound. The shoulders, head, and especially pinnae act as filters interfering with incident sound waves by reflection, diffraction, and absorption [22,23]. Careful selections on the head dampening factor, materials for the ears/pinnae development, and location of the microphones are equally important for a realistic distortion of the sounds [24]. In [25], artificial human ears that were made from silicone were used to capture binaural and spectral cues for localization in both azimuth and elevation planes. To enhance the localization and disambiguation performance while retaining the binaural hearing technique and structure, a number of recent works have proposed using active ears, which is inspired by animals, such as bats, which are able to change the shape of their pinnae [26][27][28]. In this regard, the ears act as actuators that can induce dynamic binaural cues for a better prediction.
While reproducing the auditory model of binaural hearing may be a challenging problem, the past decade has seen a renewed interest in binaural approaches to sound localization, which has been applied in a wide area of research and development, including rescue and surveillance robots, animal acoustics, as well as human robot interactions [29][30][31][32][33][34][35]. Unique or predetermined sound sources for instance can be embedded with search and rescue robots for ad-hoc localization in hazardous or cluttered environments, as well as for emergency signals in remote or unknown areas [36,37]. This approach is particularly useful when searching that is based on visual is occluded by obstacles, but allows for sound to pass through [38].
Inspired by the intricacies of human ears and how they can benefit plethora of applications if successfully reproduced, this research aims to build a binaural model that is similar to the human auditory system for both vertical and horizontal localizations using a pair of ears that were 3D-printed out of Polylactic Acid (PLA). Unlike silicones, which were mostly used by past researchers for binaural modelling [39], PLA is generally more rigid and a more common material in domestic 3D printers. Using 3D printed ears with PLA will also allow for cheaper and quicker replication of this work in future studies. The ears that were anatomically modeled after an average human ear were additionally mounted on a Styrofoam head of a mannequin to get the shape and size of an average human. The purpose is to build a binaural recording system similar to the HATS to be able to capture all different auditory cues, and for the system to have a head shadow to change the spectral components. The HATS replica may not be as good as the actual simulator, but it provides a cheap and quick alternative for simple measurements.
In this work, an analysis on the proposed model was firstly conducted to study the correlations between the spatial auditory cues and the 3D polar coordinate (i.e., distance, azimuth and elevation angles) of the targeted sound source. Apart from the techniques via interaural and spectral cues, the time for the sound pressure level (SPL) resulting from the combined direct and reverberant intensities to decay by 60 dB (hereafter denoted as DRT 60 ) is also introduced as part of the localization strategy. The preliminary analysis reveals that the latter provides a much more accurate distance estimation as compared to the SPL approach, but is alone not sufficient to disambiguate the front-rear confusions. For vertical localization, it is also shown that the elevation angle can be robustly encoded through the spectral notches. By analysing the strengths and shortcomings of each estimation method, a new algorithm is formulated in order to localize the sound source, which is also further improved with induced secondary cues via cross-correlation between the interaural and spectral cues. The contributions of this paper can thus be summarized as follows: (a) auditory cue analysis of ears that were 3D-printed out of cheap off-the-shelf materials, which remains underexplored; and, (b) a computationally less taxing binaural localization strategy with DRT 60 and improved disambiguation mechanism (via the induced secondary cues) This work is motivated by recent studies on binaural localizations for indoor environments that utilized ITD in a 3D space [40], both ILD and ITD [41] in a 2D space, and DRR for distance estimation of up to 3 m [42]. The aforementioned literatures however did not use pinnae for front-rear disambiguation, hence requiring either a servo system to integrate rotational and translational movements of the receiver or other algorithms to solve for unobservable states in the front-rear confusion areas until the source can be correctly localized. Nevertheless, instead of targeting indoor environments with additional control systems to reduce the estimation errors, the focus of this work is on outdoor environment with a relatively larger 3D space. As both the DRR and reverberation time (RT) change with the distance between the source and the receiver [43], particularly in outdoor spaces [44,45], this property has been exploited and correlated with the distance to further improve the estimation accuracy. The proposed technique in this study has been validated via a series of experiments where the sound source was randomly placed at 30 different locations with the distance between the source and receiver of up to 19 m.

Sound Ambiguity
Human binaural hearing is able to approximate the location of a sound source in a spherical or 3D coordinate (i.e., azimuth and elevation planes). This is achieved by the shape and position of the ears on the head, and the auditory cues interpreted by the brain. As illustrated in Figure 1, the azimuth angle is represented by θ, while the elevation angle is represented by φ. Moving the sound source from left to right would change θ and moving it up and down will change φ, and each varies from 0 • to 360 • . With only two microphones in a binaural system, there will be localization ambiguity on both azimuth and elevation planes. With regard to the azimuth plane, for every measurement, there will be an ambiguous data point located at the mirrored position along the interaural axis where the two microphones are placed as illustrated in Figure 2a.
Localizing the sound source on the elevation plane is relatively more difficult as there will be an infinite amount of ambiguous positions as depicted in Figure 2b. This paper looks into finding the actual sound location by using auditory cues. When the distance between a sound source and the microphones is significantly greater than the distance between the microphones, we can consider the sound as a plane wave and the sound incidence reaching each microphone as parallel incidence.  For the elevation angle φ, when the sound source is located at θ = 0 • , ITD and ILD for each ear would theoretically be the same. For an omni directional microphone, this would be impossible to solve as there are infinite number of possibilities where the actual sound source could be located, as there would be no difference in values when taking measurements. Nevertheless, with the addition of the head and ear, sound is reflected and attenuated differently, as it is moved around the head. Attenuation happens in both the time and frequency domain for different frequency ranges. This work aims to localize a sound source in the azimuth and elevation planes while only retaining the actual location by removing the ambiguous points. The proposed method in this work is based on the analysis of different auditory cues and characterization of their properties in order to estimate the location of the sound source relative to the receiver.

Materials, Methods and Analysis
The experiment setup in this work follows the HATS, where the geometry is the same as an average adult. For this work, the ear model which was 3D-printed out of PLA was scaled to the dimensions, as shown in Figure 3a to fit the application (for the pinnae shape, we have referred to https://pubmed.ncbi.nlm.nih.gov/18835852/ which provides the database containing 3D files for human parts for biomedical research). A microphone slot for each side of the head model was also designed as depicted in Figure 3a,b, and the two microphones were connected to a two-channel sound card (Focusrite Scarlett 2i2 model) for simultaneous recording (Figure 3c). The hardware component consists of two parts, namely the microphone bias and the sound card, as shown in Figure 3d. The gain was adjusted for each ear to balance the gain for the left and right ear. The ears were also polished with a solvent to smooth out the plastic parts, and treated with the fumes from the solvent to smooth out the internal parts. A mechanical mesh was then placed on top of each microphone when assembling the 3D printed ear to act as a filter. For printing, a DreamMaker Overlord 3D printer was used. Details on the printing parameters are listed in Table 1 (the printer is available at https://www.dfrobot.com/product-1299.html, while the STL file is available from the Supplementary Materials). The total cost for this setup is approximately USD175 (i.e., USD11.4 for the 3D-printed ears, USD 2.89 for the Styrofoam head, USD156.5 for the sound card, and USD 4.2 for the bias circuit).   Figure 4 shows the binaural processing chain within the device under test (DUT) in order to localize the sound source. The first stage is the data acquisition from the left and right inputs of the microphones. To ensure no sound leakage, the microphone is sealed to the printed ear with silicone. The next stage is the amplifier stage, which biases the signal to 1.5 V. The microphones used are rated to 3.0 V, so using a standard lithium battery was enough to prevent the microphones from saturating at the reference point. Following the amplifier (after the analog-to-digital (ADC) converter) is the filtering stage, which consists of a bandpass filter with a cut-off frequency, f c = 3.5 kHz and a bandwidth, BW = 1.2 kHz to attenuate the effects from environmental noise. The sound source considered has a frequency range from 2.8 kHz to 4.0 kHz, hence other frequencies beyond this range can be filtered out.
The following block is the analysis of the auditory cues split into four categories, spectral cues (SC), and DRT 60 (explained further in Section 3.2), SPL and ITD. Fast Fourier Transform (FFT) is performed on the filtered signal in order to find the spectral components in the frequency domain of the audio source used in SC. DRT 60 , SPL, and ITD, on the other hand, are measured in time domain. The ITD and SC are essential for azimuth angle and elevation angle estimations respectively. In order to estimate the Euclidean distance between the center of the DUT and the sound source, both DRT 60 and SPL will be used. Ambiguous data points are then filtered before the sound source is localized. Before the actual experiment was conducted, a preliminary analysis was performed in order to ensure its feasibility. To observe the SPL and frequency responses with respect to azimuth and elevation angles, a sound source was placed at d = 110 cm from the receiver and positioned on a rotating jig as depicted in Figure 5. A Bluetooth speaker was used in order to play a recording of the sound source intended for the actual experiment. Figure 6 shows how θ and φ are measured on their respective planes. During the pilot testing (a controlled environment with a noise floor below −60 dB was selected for this testing), the sound source was rotated along the azimuth and elevation planes at a step of 15 • . The jig was used to ensure consistent angle increments and keep the sound source at a fixed distance. In this test, both left and right audio were captured simultaneously, and each test was repeated three times to analyze the consistency of the measurement setup. Figure 7 shows the polar plots for the SPL that was measured at the left and right ears for the three trials on both the azimuth and elevation planes. For every instance of azimuth and elevation angle, FFT was applied to the signal and the peak at each desired frequency point was measured. Figure 8 illustrates the frequency responses of the spectral components of the sound source.    Figure 9 illustrates the variations of the SPL and the frequency response in a 3D Cartesian plane, where x 0 , y 0 and z 0 correspond to d cos θ, d sin θ and d sin φ, respectively. Based on the SPL response, it is observed that the variations of the amplitude are relatively much smaller on the azimuth plane (i.e., y vs x) as compared to that on the elevation plane (i.e., z vs y). With regard to the frequency response, it can be seen that the amplitude on the azimuth and elevation planes change significantly enough that it is distinguishable from other coordinates. This is due to the shape of the ears as well as reflections around the head which induced notches into the spectrum. This signifies the suitability of the cues to be used as part of the techniques for horizontal localization, and the notches in the frequency response for vertical localization. The following sections will describe in greater detail how these properties, along with ITD, DRT 60 , and SC will be exploited in order to localize the sound source.  Figure 9 illustrates the variations of the SPL and the frequency response in a 3D Cartesian plane, where x 0 , y 0 and z 0 correspond to d cos θ, d sin θ and d sin φ, respectively. Based on the SPL response, it is observed that the variations of the amplitude are relatively much smaller on the azimuth plane (i.e., y vs x) as compared to that on the elevation plane (i.e., z vs y). With regard to the frequency response, it can be seen that the amplitude on the azimuth and elevation planes change significantly enough that it is distinguishable from other coordinates. This is due to the shape of the ears as well as reflections around the head which induced notches into the spectrum. This signifies the suitability of the cues to be used as part of the techniques for horizontal localization, and the notches in the frequency response for vertical localization. The following sections will describe in greater detail how these properties, along with ITD, DRT 60 , and SC will be exploited in order to localize the sound source.

Interaural Time Difference (ITD)
In order to estimate the direction of the sound source, the angle of the incident wave with respect to the DUT, which is also known as angle of arrival (AoA) needs to be found. This is done by comparing the delay between the sound signal of the two microphones, which is termed ITD in the context of binaural hearing. To this purpose, let the ITD be written as τ d = |t R − t L |, where t R and t L refer to the time of arrival of the sound between both microphones, which is 0.20 m, and ν s is the speed of sound, i.e., 343 ms −1 . From the illustration shown in Figure 10, it can be intuitively seen that the wave front

Interaural Time Difference (ITD)
In order to estimate the direction of the sound source, the angle of the incident wave with respect to the DUT, which is also known as angle of arrival (AoA) needs to be found. This is done by comparing the delay between the sound signal of the two microphones, which is termed ITD in the context of binaural hearing. To this purpose, let the ITD be written as τ d = |t R − t L |, where t R and t L refer to the time of arrival of the sound between both microphones, which is 0.20 m, and ν s is the speed of sound, i.e., 343 ms −1 . From the illustration shown in Figure 10, it can be intuitively seen that the wave front will arrive at Mic L later than it does at Mic R. The AoA, β, as seen by Mic L relative to Mic R can be calculated using Equation (1), below.
(1) To quantify the phase shift, cross-correlation was applied in order to measure the ITD between the two signals. The cross-correlation in Equation (2) is used to calculate the ITD between the two signals, where N = 44,100 refers the total number of observations, m a (i) is the signal that is received by Mic R and m b (i) is the signal received by Mic L. The notations m a and m b denote the mean of m a (i) and m b (i), respectively. The cross correlation coefficient R ab can then be calculated, as follows which would return a value ranging from −1 to 1. Audio was taken at a sampling rate, f s , of 44,100 samples per second. The returned value of the cross correlation coefficient would denote how many samples apart the two wave forms are. It is worth noting that the ILD can also be used as a means of measuring the AoA by comparing the ratio of attenuation between each ear. The amount of attenuation and the ratio between left and right would be characterized by placing the sound source at θ = 90 • and θ = 270 • . The ILD is able to capture the AoA by comparing the attenuation, but it is not as accurate as using the ITD. As an example, when the audio source is closer to the left ear at θ = 45 • , the amplitude is higher than the right and vice versa. When the sound source is at θ = 0 • , the amplitude is roughly the same level. The method of using ILD to estimate the angle is inaccurate and unreliable when compared to cross correlation of ITD. There are many factors affecting the attenuation of sound, such as environment, distance from sound source, and reflection, which can cause the estimation of angle based on this parameter to be temperamental. Since cross correlation looks at the similarity of the audio signal between left and right, it is more robust and not as susceptible to interference. In this work, the cross correlation of ITD is more consistent at determining the AoA when compared to the attenuation ratio method of estimating AoA based on ILD. From the testing, the ILD estimation method using the attenuation ratio has an error of ±20 • , while ITD has an error of ±10 • . Although the ILD is not directly used in the estimation of angle in this study, the SPL at each ear are instrumental for distance estimation and front-rear disambiguation. The subsequent sections will present the analyses on DRT 60 and ILD along with the proposed methods in order to estimate the distance and direction of the sound source.

Direct and Reverberant Energy Fields
While the RT is predicted to be constant using the Sabine's equation in many enclosed acoustical environments, it has been shown in [43] that it can vary with the distance between the sound source and the receiver under certain circumstances, thus contributing to the variation of the DRR with distance. The dependency of the RT with distance is also more prominent in outdoor spaces as reported in [44,45]. As a consequence, the SPL that is measured at the receiver is usually a combination of energies from both the direct and reverberant fields, which is consistent with the theoretical conclusion in [21]. Hence, depending on applications, considering the combined pressure level would be relatively more practical due to the observed dynamics of both DRR and RT in past studies.
In this work, a car honk was used as the targeted sound source as it creates distinctive acoustic characteristics that are suitable for outdoor spaces. The impulse to noise ratio (INR) for this sound is above 44.2 dB, which is sufficient according to the ISO 3382-2 for accurate RT measurement in outdoor spaces within 50 m range [45]. Its unique identity was represented by its frequency components, where the range varied from 2.9 kHz to 4.0 kHz with peaks at every 200 Hz interval. In this analysis where the setup was done outdoors, the sound source was initially placed at the front of the DUT on the azimuth plane (i.e., θ = 0 • , φ = 0 • ), and data was captured when it was located at varying distances ranging from 1 m to 19 m. Figure 11a shows the time response of the measured sound amplitude after the source was abruptly switched off at different distances. In order to calculate the DRT 60 , which refers to the time for the combined direct and reverberant energy level to decay by 60 dB, the perceived signal was firstly band-passed to the desired frequency range of 2-4 kHz. Considering E(t) = ∞ t h 2 (τ)dτ as the energy decay curve from time t where h(t) is the impulse response from the band-passed signal, a linear regression was performed in order to estimate the slope, S, between the −5 dB and −25 dB level range (similar to RT estimation via T20 method: https://www.acoustics-engineering.com/files/TN007.pdf). The DRT 60 can then be estimated as −60/S. The corresponding DRT 60 against distance is depicted in Figure 11b which shows the average DRT 60 of five trials along with the error bars. In order to analyze the variation of DRT 60 further, the same test was conducted with θ varied from θ = 0 • until θ = 360 • at a step of 45 • . Figure 12 shows the DRT 60 against the azimuth angle. The measured DRT 60 however did not reveal any distinctive trend and it only has small deviations at different angles. Comparing Figures 11 and 12, it can be concluded that the DRT 60 value changes most significantly against distance, and the variation against θ is negligibly small. The next section will explain how the DRT 60 response will be used along with the SPL in order to estimate the distance and treat the ambiguity issue.

Ambiguity Elimination and Distance Estimation
Apart from the DRT 60 test, another test to investigate the variation of SPL was also conducted. Figure 13 shows the variations of the average SPL from both ears against distance when the sound source was located at the front (blue line) and back (orange line) positions. The average amplitude and error bars are represented by the curve and vertical lines, respectively. Theoretically, the sound intensity changes with distance following the inverse square law, as represented by the yellow line in the figure. A large difference can be seen from the theoretical and measured SPL curves due to the existence of ears and head as well as environmental effects. The amplitude attenuation is also relatively higher when the sound source is located at the back of the head as compared to the front. Based on the SPL measurements, the following correlation can be derived: ( where α f and α b represent the average SPL for front and back positions respectively. Via curve fitting techniques, one will obtain p f , q f , r f = (−5.2, 0.4689, 7.085) and (p b , q b , r b ) = (−7.7, 0.4599, 19.989). It is worth noting that the sound amplitude alone is insufficient to determine both the distance and direction. To treat this issue, the attenuation of the sound source is used in order to eliminate the ambiguity of sound's location, since the DRT 60 value is relatively more consistent for all values of θ and φ. Via regression, Equation (4), which provides a less mean squared error than other polynomials can be derived with (p R , q R , r R ) = (0.01693, 8.3494, 204.1312), which represents the correlation between DRT 60 (denoted by τ R in milliseconds) and the distance d.
Hence, the inverse function of Equation (4) can be attained as follows: which returns the distance estimated based on the value of τ R measured from the received signal. Likewise, the estimated distance based on SPL measurements can be obtained in a similar manner from Equation (3), which leads to where α is the SPL, d f and d b are the predicted distance values for the front and back locations. In order to eliminate the sound source ambiguity, two parameters need to be observed; the first one is the difference between d j and d R , and the second one is the elevation angle φ (the method to estimate this is presented in Section 3.4). For the first one, the values of d b and d f are compared against the value of d R , and the one with the closer value will return the estimated distance and direction based on SPL, denoted by d α , and the other will be the ambiguity to be eliminated. With regard to the second parameter, two sets of angles can be firstly defined, as follows: where Ω f and Ω b refer to the yellow area and blue area in Figure 6b, respectively. The ambiguity checker can then be written as: and η = 1 and η = 0 indicate whether the sound source is located at the front position with respect to DUT respectively. As there are now two methods for estimating the value of d (i.e., via DRT 60 and via SPL), the following technique is proposed: with ν representing the weighting parameter that varies between 0 and 1. To find the optimal value of ν, a further analysis was conducted based on 16 datasets, as presented in Table A1 (in Appendix A), where half of them refer to the case when φ ∈ Ω f , while the other half refer to the case when φ ∈ Ω b . In this analysis, the distance between the DUT and source varied between 6 m and 19 m. The cumulative distance error, which reads: e k ; e k = d −d (10) with d being the actual distance is considered. Figures 14 and 15 show the corresponding plots when φ ∈ Ω f and when φ ∈ Ω b , respectively. By observing the value of ν when E cum is minimum, it is found that the distance error can be minimized when ν = 0.37 when d α = d f , and ν = 0 when d α = d b . The latter indicates that the estimated distance that is based on the DRT 60 is generally much closer to the actual value when the sound source is located at the back of the DUT, thus only d R is considered in this scenario.  Combining Equations (8) and (9) and solutions from Figures 14 and 15, the distance estimation along with ambiguity elimination can be further derived, as follows:

Spectral Cues (SC)
The clues to sound location that come from sound frequency are called spectral cues. These spectral cues derive from the acoustical filtering of an individual's auditory periphery. Since the angle and distance on the azimuth plane can be calculated using ITD, SPL and DRT 60 , but not for the elevation plane φ, the spectral cues are vital in determining the elevation of the sound source. The ambiguous data points in the cone of confusion can be reduced using mathematical estimation. This work addresses the cone of confusion by characterizing the attenuation of different frequency elements against φ. Figure 16 depicts the amplitude (A p ) at each peak frequency, f p , when the sound source was placed at φ = 0 • (blue line), φ = 90 • (orange line), φ = 180 • (yellow line), and φ = 270 • (purple line). The data were also captured at three different distances; d = 6 m (a), d = 13 m (b), and d = 19 m (c).
In order to characterize the amplitude response against φ at each peak frequency, a linear regression was performed based on the average values of A p in Figure 16, which led to the following statement: if f p = 3.9 kHz undefined otherwise (13) where with α 0 ∈ R − being the amplitude in dBFS of the received signal, and α i , c i ∈ R + and b i ∈ R − are the coefficients that depend on α 0 . By measuring A p and α 0 from the incoming signals' spectral components, the angle φ i can then be calculated by solving the inverse function of Equation (14), which reduces to In order to obtain the estimated φ when the sound source is placed at a particular location, the calculated angle is averaged over all peak frequencies. Figure 17 shows the results from a simple test when the source was placed at φ = (0 • , 45 • , 90 • , 135 • , 180 • ). The left plot is the case when the source was 6 m away from the DUT, while the right plot is the case when the source was 13 m away from the DUT. From the test, it was found that the magnitude of the error only varied between 1.34 • and 6.22 • , which can be considered to be small, as the average error is less than 3.5%. Thus, the close relationship between the SC cues and the elevation angle will allow for the vertical direction of the source to be robustly localized.

Binaural Localization Strategy
In summary, the direction of the sound source on the azimuth plane can be calculated using the ITD cue via cross correlation on the incident signals. The resulting AoA can then be used in order to estimate the value of θ. To predict the actual distance of the source from the DUT, the properties from the SPL cues can be exploited. Nevertheless, due to the structure of the head and the 3D-printed ears, estimations via SPL are not sufficient, thus the estimation via DRT 60 auditory cue that has less variation against angles is needed together with the weighting parameter derived in the preceding section to remove ambiguous data points. With regard to the elevation angle, SC will be exploited by finding the amplitude and peak frequencies from the signal's spectral components.
To improve the performance during real-time experiments, induced secondary cues are introduced based on the estimated distance and elevation angle, which are represented by η and µ, respectively. Specifically, η = 1 when the sound source is estimated at the front side of the DUT (based on the SPL), and µ = 1 when the estimated φ is within Ω f . Hence the parameter will be unity when both η and µ are one, which corresponds to Equation (12). This will be the first stage of the ambiguity elimination technique. To treat the front-rear confusion further on the resulting azimuth angle, the values of η and µ will be cross-checked at the second stage; i.e., if η = 0 and µ = 1, then the sound source is expected to be at the mirrored position along the interaural axis (i.e., front side). This was formulated based on the idea that prediction based on µ would be more accurate due to the small position errors that are presented in Section 3.4. However, exceptions are imposed for the border case where the estimated angle within the margin areas; i.e., (85 • , 95 • ) and (265 • , 275 • ) remain unchanged. The whole procedure for the binaural localization with ambiguity elimination partitioned into two stages is summarized in Algorithm 1. For clarity purposes,θ,φ,d will be used to denote the estimated values for θ, φ and d, respectively.

Experiments and Performance Evaluations
This section presents the results from real-time experiments when the sound source was placed at 30 different locations in the 3D space. The tests were conducted in a car park area with the model being placed on the road, as shown in Figure A1 (in Appendix A), which has existing linear markers that allow for accurate distance and direction measurements. Three different distances i.e., d = 6 m, d = 13 m, and d = 19 m, with various sets of θ and φ were randomly selected for performance evaluations. Without a loss of generality, measurements for θ and φ were taken by rotating the receiver instead of the sound source, as it was relatively easier to control.
The values ford,θ andφ when Algorithm 1 was applied are presented in Table 2, which have been partitioned according to the values of d. All of the captured data, including the secondary cues, η, µ and that were used for ambiguity elimination can be referred in Table A3 in the Appendix A. For clarity purposes, the variable k is used in order to represent the experiment number for each distance considered. Figure 18 shows the estimated and actual locations of the sound source with respect to the DUT in a 3D Cartesian plane that have also been plotted according to the values of d, i.e., (a) d = 6 m, (b) d = 13 m, and (c) d = 19 m. The actual coordinates are represented by the colored circles, while the corresponding predicted coordinates are represented by the "diamonds" of the same color. The numbers next to the circles are included to denote the values of k from Table 2. By observing the plots, all of the coordinates considered were correctly localized with small position errors, except for k = 6 in (a). This was caused by the value of η which was supposed to be 1 instead of 0, hence the estimated azimuth was interpreted at the mirrored position of the captured angle, which explains the large difference. Nevertheless, when comparing with the results without the application of Algorithm 1 from Table 3 (complete  individual data in Table A3), we can see that the total number of ambiguous points (AP) is 9. This demonstrates that the proposed method has significantly reduced the total number of AP.   In order to evaluate the localization performance, the following errors are defined: which calculates the deviation of the estimated from the actual values, and which is the average value of absolute errors. Figure 19 shows the plots of e(d) (represented by the blue line), which is also compared against the corresponding errors when d b , d f , and d R are used as the estimated distance. From the plot, it is observed that the proposed method has successfully kept the error minimum for all experiments when compared to the performance by the other three methods. With regard to the accuracy of the estimated angles, Figure 20, which shows the plots for e(θ) and e(φ), is also compared against the error before the azimuth angle was amended in Stage 2 of Algorithm 1, i.e., θ 0 . The large peaks shown from the orange plots correspond to the results from the ambiguous data points where the mirrored positions of the source were not corrected using the secondary cues from the proposed method. Other than that, it is observed that ε(φ) is consistently close to zero for all experiments, which has also become the contributing factor for the success in the ambiguity elimination technique. The overall average errors from both figures are summarized in Table 3 where E av = (E av,d=6 + E av,d=13 + E av,d=19 )/3. From the data presented, the proposed method has significantly improved the performance by reducing the errors in distance and angle estimations. It is also worth noting that, without the DRT 60 and SC measurements as well as the secondary cues, the estimated sound source locations on the azimuth plane would be 100% ambiguous. In particular, with only Stage 1 in Algorithm 1, which also heavily relies on the ITD method (refer to θ 0 in Table 3), the total ambiguous points (AP) was reduced to 30%, but, when combined with Stage 2 (refer toθ in Table 3), the total AP has been considerably reduced to 3.3%. Table 3 also shows that, due to the large number of AP from θ 0 , the average error, E av , is approximately 28.3 • , which is significantly higher than that when the complete Algorithm 1 is applied, which only gives an average error of 9.6 • .

Discussion
The results, as presented in Table 3, have demonstrated significant improvements in the distance and angle estimations, thus showing that using PLA-based 3D printed ears is practical, particularly for front-rear disambiguation in outdoor environments. While this might work in several other environments, modifications on the strategy may be needed if there is a sudden or drastic change in the acoustic scene. Thus, to detect as well as identify the changes, machine learning can be used and the resulting mechanism can be embedded into the system so as to ensure the proposed strategy is adaptive to the changes. Apart from that, as the reverberation properties in outdoor spaces can be modeled according to the sound source frequency as well as the nature of the spaces, the DRT 60 -based distance estimation technique in Section 3.2 can always be tuned in order to make it applicable to other environments.

Conclusions and Future Work
This paper contributes its findings to binaural localization using auditory cues. Instead of using a HATS (this costs approximately USD20k, and USD120 for daily rent) or an ear simulator, this work uses a pair of cheap PLA-based 3D-printed ears with mechanical acoustic dampers and filters covering the microphones. The analysis that was obtained from this work shows that there is a possibility in using cheap 3D-printed materials in order to simulate an actual ear. Other benefits of using a 3D printed ear include the ability to quickly replicate this work, and to make modifications to the existing design to study how different shapes would affect the result.
From the conducted experiments, it has been demonstrated that the proposed strategy can considerably improve the binaural localization performance with average errors of 0.5 m for distance, 9.6 • for azimuth angle, 10.3 • for elevation angle, and, most importantly, a significant reduction of total ambiguous points to 3.3%. The results also reveal that the proposed model and methodology can provide a promising framework for further enhancement of binaural localization strategy.
Having dynamic cues, in addition to what this work has presented, can help enhance the accuracy, particularly when there is a drastic change in the acoustic scene or when the targeted sound source is moving. Tracking a moving source or multiple sources is significantly more complex, as Doppler effects come into play and, thus, the spectral cues has to account for the phenomena. Dynamic cues are useful to help further improve how the receiver perceives sound by essentially getting more sets of data. As discussed in Section 5, the method can be paired with advance algorithms in future works, such as deep learning, to help improve the detection of acoustic cues that are based on different situations.  θ,φ,d actual azimuth angle, elevation angle, Euclidean distance of the sound source from the DUT θ,φ,d estimated azimuth angle, elevation angle, Euclidean distance of the sound source from the DUT β notation for AoA θ 0 Estimated azimuth angle before correction µ, η, induced secondary cues (as described in Algorithm 1) x, y, z estimated coordinates of the sound source in the 3D space τ d notation for ITD d f , d b estimated distance based on SPL regression curve when the sound source is in the front/back of the receiver d R estimated distance based on DRT 60 τ R notation for DRT 60 (in milliseconds) sets of elevation angle defined in Equation (7) ν weighting parameter for the estimated distance e deviation of the estimated from the actual values (for θ, φ, and d) E cum , E av Cumulative error, Average of absolute error R fields of real numbers R + ,R − fields of positive real numbers, fields of negative real numbers Appendix A Table A1. Datasets for the analysis in Section 3.2.  Notations/ Acronyms Descriptions d R estimated distance based on DRT 60 τ R notation for DRT 60 (in milliseconds) Ω f , Ω b sets of elevation angle defined in Equation (7) ν weighting parameter for the estimated distance e deviation of the estimated from the actual values (for θ, φ, and d) E cum , E av Cumulative error, Average of absolute error R fields of real numbers R + ,R − fields of positive real numbers, fields of negative real numbers Appendix A