Head-Related Transfer Functions for Dynamic Listeners in Virtual Reality

In dynamic virtual reality, visual cues and motor actions aid auditory perception. With multimodal integration and auditory adaptation effects, generic head-related transfer functions (HRTFs) may yield no significant disadvantage to individual HRTFs regarding accurate auditory perception. This study compares two individual HRTF sets against a generic HRTF set by way of objective analysis and two subjective experiments. First, auditory-model-based predictions examine the objective deviations in localization cues between the sets. Next, the HRTFs are compared in a static subjective (N=8) localization experiment. Finally, the localization accuracy, timbre, and overall quality of the HRTF sets are evaluated subjectively (N=12) in a six-degrees-of-freedom audio-visual virtual environment. The results show statistically significant objective deviations between the sets, but no perceived localization or overall quality differences in the dynamic virtual reality.


Introduction
Virtual sounds, rendered to headphones, often fail to localize in space or externalize from within the listener's head. Sound localization in a 3-dimensional space relies on acoustic cues described by the head-related transfer functions (HRTFs) [1]. Using the listener's own HRTFs results in the best localization performance in static listening, but measuring or modeling individual HRTFs remains a challenging process. With head tracking and multimodal environments, such as in virtual and augmented reality (VR, AR), generic, non-individual HRTFs may still enable accurate auditory perception comparable to individual sets. This paper presents an objective analysis and two subjective experiments to compare two different individual HRTF sets with a generic set in 6-degrees-of-freedom (6-DoF) VR, where rotational movement around the x, y, and z axes (pitch, yaw, and roll), and translational movement along the axes (surge, strafe, and elevation) are possible.
Static, non-head-tracked listening with generic HRTFs results, most evidently, in localization errors in elevation and in front-back or quadrant confusions when compared to free-field listening in an anechoic chamber or with individual HRTFs [2][3][4]. Similar findings of increased elevation and quadrant errors with generic HRTFs during static reproduction have also been recently reported [5]. Additionally, externalization may be degraded when listening with generic HRTFs compared to individual HRTFs, as noted in a recent review [6]. However, the results regarding modified spectral cues are mixed, and other cues, such as reverberation, movement, and vision, may be more critical factors for robust externalization.
Studies going beyond static scenarios have found a reduction in front-back confusions when head movements are allowed [7][8][9] both with and without headphones. Significant decreases in elevation errors have also been identified when moderate head motions are Hypothesis 2. The individual HRTF sets are objectively similar to each other on average.

Hypothesis 3. Static localization accuracy is higher with individual HRTFs compared to generic HRTFs.
Hypothesis 4. The HRTF sets are equally preferred in localization, timbre, and overall quality in 6-DoF VR.

HRTF Sets
Three different HRTF sets were obtained for the experiment: a measured, modeled, and a generic set. The measured and modeled HRTFs were produced for the same group of people (N = 12), who also participated in the subjective experiments. A fourth HRTF set was derived from the generic HRTF set by keeping only the measurement points located in the horizontal plane and utilized in the second subjective listening experiment.
Individual HRTFs were measured in a semi-anechoic chamber at 1.2 m distance with a procedure described in [28]. The measurement setup comprised an array of 64 loudspeakers mounted on a semi-circle in elevation direction. A multiple exponential sweep was used to allow overlapping signal playback from multiple loudspeakers in parallel. The spatial resolution was 5 • in azimuth and 2.5 • in elevation covering the full azimuth circle and elevations from +90 • to −70 • resulting in Q = 4608 measurement positions with a filter length of 386 samples. The measurements were made at the entrance to the blocked ear canal. The floor reflection was cut out from the measurements in post-processing, and a low-frequency boost was applied to compensate for the non-flat magnitude response of the small drivers in the measurement setup.
The modeled HRTFs were based on high-precision 3D scans of the participants' pinnae and a set of anthropometric measurements of the head size and ear position. The measurements were used to scale the shape and dimensions of a scanned generic head-andtorso model, onto which the individual pinna models were stitched. A boundary element method numerical simulation then produced the modeled HRTFs [29]. Each modeled HRTF set had the same distribution and distance of measurement points and the same filter length as the measured HRTF set.
The generic binaural head (Neumann KU100) HRTFs were obtained from the spatial audio for domestic interactive entertainment (SADIE) database (https://www.york.ac. uk/sadie-project/index.html (accessed on 29 May 2019)). These measurements comprise Q = 1550 measurement positions at 1.5 m distance with azimuthal increments of 5 • and elevation increments of 10 • with added measurement points to satisfy specific Ambisonics reproduction setups. The head-related impulse response length is 256 samples. Notably, the KU100 binaural head does not include a torso, removing the shoulder reflection effect from the generic HRTFs. The measured, modeled, and generic HRTF set signal processing chains were level aligned to 68 dB A sound pressure level at 1 m distance using pink noise and a binaural head with a virtual sound source rendered directly in front.
Following [4,30], we employed a decoupled equalization technique using a reference sound field, in our case the diffuse field, to equalize the measured HRTFs and the headphones. This technique is opposed to a non-decoupled equalization, where a headphone transfer function (HPTF) is measured with the same measurement setup (including the head and microphones) as for the recording of the HRTFs. The decoupled equalization technique is found to offer the same degree of fidelity as using the HPTF approach, but with the added flexibility of equalizing the recording system and reproduction system in independent sessions using different equipment. The HRTFs were diffuse-field equalized for this experiment by calculating the average magnitude over all directions, inverting it, and creating a minimum-phase filter, which was then convolved with the HRTFs. The KU100 binaural head readily produces diffuse-field equalized measurements.

Objective Analysis
The objective analysis used two different auditory models, both implemented in the Auditory Modeling Toolbox [31], to gain objective metrics on the deviations in azimuth and elevation localization cues between the measured, modeled, and generic HRTF sets. We used a binaural azimuth estimation model [32] to obtain interaural time and level difference (ITD; ILD) estimates for each participant (N = 12) and the KU100 generic binaural head. The equivalent rectangular bandwidth (ERB) center frequency was set to 1000 Hz for ITD modeling and 4300 Hz for ILD modeling. Figure 1 displays the auditory-model-based estimates for the ITD (upper panel) and ILD (lower panel). Focusing on the maximum ITDs, the φ = 90 • azimuth horizontal plane mean ITD for the set of measured HRTFs is 723 µs (SD = 27 µs) and for the modeled HRTFs 669 µs (SD = 26 µs). For the KU100 generic head, the modeled ITD at 1 kHz is 786 µs. The difference in mean ITDs between the individual sets and compared to the generic set are above the just noticeable difference value of 20 µs found in previous literature [33]. Additionally, a paired sample Wilcoxon signed rank test showed that there is a statistically significant difference between the ITD populations (p = 0.002) and one-sample Wilcoxon signed rank tests showed significant median differences compared to the generic head ITD (measured p < 0.001, modeled p < 0.001).  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85  90  95  100  105  110  115  120  125  130  135  140  145  150  155  160  165  170  175  The deviations in ILD are more difficult to summarize. Still, a consistent ILD discrepancy can be observed in the azimuth range φ = {30, ..., 70} • , where the measured and generic HRTFs yield larger ILD values compared to the modeled HRTFs. Further to the side the ILD predictions become more varied between the HRTF sets, before converging to similar values beyond φ = 160 • behind the head.
For localization estimates in elevation, we used a sagittal-plane sound localization model [34]. This model assumes the human brain holds an internal template set of learned HRTFs, which can be compared with an incoming sound filtered with a target HRTF that can either be the person's own HRTF or a generic HRTF with some degree of deviation from the template. The model predicts the percentage of quadrant errors (θ > 90 • ) and the root-mean-square (RMS) local elevation errors in degrees (θ ≤ 90 • ) based on a comparison process between the target representation and template HRTFs. Here, the prediction was conducted for the elevation angles from −70 • to +250 • in 2.5 • increments in the 0 • azimuth sagittal plane. Table 1 displays the prediction results averaged over all elevation angles for the cases: (I) template and target are the modeled HRTFs; (II) template and target are the measured HRTFs; (III) template are the measured HRTFs, target the modeled HRTFs; (IV) template are the modeled HRTFs, target generic HRTFs; and (V) template are the measured HRTFs, target generic HRTFs. The results are further depicted graphically in Figure 2. The RMS local elevation errors for matching template and target HRTFs are significantly lower than the RMS elevation error averages when combining measured templates with modeled target HRTFs. Similar large elevation error values are observed when combining generic target HRTFs with the modeled or measured HRTF templates.  Comparable trends are observed with the predicted quadrant errors percentages, where the matching modeled HRTFs show the least quadrant errors followed by the matching measured HRTFs. Mixing measured template HRTFs with modeled target HRTFs yields a significantly higher errors percentage. A similar range of quadrant errors is observed for the generic target HRTFs combined with modeled or measured template HRTFs.

Subjective Experiments Overview
There were two sessions of subjective experiments: static localization (Section 4) and dynamic 6-DoF virtual reality (Section 5). The experiments were conducted starting with the dynamic VR experiment followed by the static localization experiment, but reported here in reversed order for readability. There was a gap of more than 12 months between the respective experiments. The dynamic VR experiment was further divided into a main experiment and a follow-up, where two additional scenes focusing on localization in elevation were evaluated a month after the main experiment. The following paragraph provides an overview of the participants in each of the subjective sessions.
The sample size for the subjective experiments was 12 participants, limited only due to the difficulties in ascertaining high fidelity measured and modeled individual HRTF sets. Subgroups of the subject pool with slight variations in numbers participated in the two experiment due to availability. For the static localization experiment, 8 people took part (1 female). Their average age is 37.4 years (SD = 11.7). In total, 12 people (2 female) took part in the main dynamic 6-DoF VR experiment. Their average age is 38.6 years (SD = 9.8). In the follow-up dynamic 6-DoF VR experiment, the two scenes focusing on localization in elevation, Above and Below, were evaluated on a separate day, where a subset of 8 participants (1 female) from the full pool did the test. This group had an average age of 37.5 years (SD = 9.3).
The participants were employees of Fraunhofer IIS or the International Audio Laboratories Erlangen at the time of the experiments, and can be considered expert listeners in audio quality evaluations [35]. However, they were not trained in VR-specific listening tests, and they were naive about the specific conditions under examination here. All of the participants reported no hearing impairments and had a normal or corrected-to-normal vision by self-report. All participants provided written informed consent to participate in each session of the experiment.

Stimuli
The experiment was constructed using an audio-visual virtual environment created in Max 8. Head-tracking, visual display, and user interaction system were provided by the HTC Vive Pro head-mounted display (HMD), using the SteamVR Version 2.0 tracking system. A visual interface in VR showed an aiming cross-hair and three concentric guide circles around the head, denoting the median, frontal, and horizontal planes. The cross-hair was coupled to the head direction. A starting location marker for each trial was drawn in VR as a circle with 2 • radius as seen from the central viewpoint at (0, 0) degrees. The stimulus sound was pulsed pink noise with a total duration of 3 s, a pulse length of 400 ms with pauses of 100 ms. The pulsed noise was used to provide onset and offset cues for auditory localization [36]. For each trial, the pulse train was convolved with HRTFs from either the measured or modeled HRTF database, matching the participant's own HRTFs, or using the generic HRTFs. Room acoustic cues were not modeled in this localization experiment, to focus only on the localization cues provided by the HRTFs. Target locations were selected as follows for the azimuth angles: φ = {0, 15, 75, 90, 105, 165, 180} • ; and elevation angles: θ = {−40, −20, 0, 20, 40} • . All combinations of the target azimuths and elevations were presented to the participants once in random order with random assignment to either the left or right hemisphere, assuming a left-right symmetry in auditory localization. The HRTFs were randomly interleaved from the three databases, i.e., the participant was unaware of the HRTF condition in each trial. The audio was played back at 68 dB A sound pressure level through an RME Fireface UCX audio interface and Beyerdynamic DT770 Pro headphones, diffuse-field equalized by the manufacturer.

Procedure
The starting location marker turned from red to green when the participant's head orientation was within the circle to indicate the next stimulus could be started. The participant was required to press a button on a hand-held controller to start the stimulus playback and keep their head static during the 3 s stimulus. Once the sound stopped, they should rotate their head orientation, thus the aiming cross-hair towards the perceived location of the sound event. Dynamic audio rendering was disabled during the stimulus presentation to ensure only static cues were available. Another button press by the participant was used to signal the perceived location of the sound event as indicated by head orientation. At the end of the trial, the participant was required to resume facing the starting marker. Four familiarization trials were included where a visual cue was spatially matching the sound event location. The practice targets were located in the horizontal plane at ±20 • and ±90 • azimuth. After the familiarization session, there were 105 trials (7 azimuths × 5 elevations × 3 HRTF sets). Figure 3 collects the local azimuth and elevation unsigned errors (panels a to d) and quadrant errors percentage (panel e). The quadrant errors percentages display a tendency for lower errors for the modeled and measured HRTFs compared to the generic HRTFs, but the effect is not significant (Friedman test: χ 2 (2,N=8) = 4.96, p = 0.084, Kendall's W = 0.31 (moderate effect size)). The local azimuth and elevation error angles are calculated after the quadrant errors (>90 • ) are filtered from the data. The azimuth errors are collected across all elevation angles and elevation errors across all azimuth angles. In the front, for azimuth errors, no large differences can be observed apart from the modeled HRTFs yielding lower errors at the φ = 15 • angle compared to the two other HRTFs. On the side, the measured HRTFs result in lower azimuth errors than the modeled or generic HRTFs, while at the back orientations, no differences are observed. The average azimuth angular errors collated across all azimuth angles are observed to be 17 • (Measured), 21 • (Modeled), and 20 • (Generic), with a non-significant main effect of the HRTF set (Friedman test: χ 2 (2,N=8) = 4.00, p = 0.135, Kendall's W = 0.25 (small effect size)). For elevation errors, negative elevation angles below the horizontal plane tend to yield greater unsigned localization errors than positive elevations. A tendency for the modeled HRTFs to yield larger errors than the other HRTF sets in the horizontal plane and below can be observed. The average elevation angular errors collated across all elevation angles are observed to be 21 • (Measured), 26 • (Modeled), and 21 • (Generic), with a non-significant main effect of the HRTF set (Friedman test: χ 2 (2,N=8) = 4.00, p = 0.135, Kendall's W = 0.25 (small effect size)). Average duration to indicate a target was found after the stimulus playback was 5.9 s (SD = 2.0 s).

Virtual Reality Environment
A platform for real-time evaluation of binaural renderers has been developed, allowing participants to switch between conditions with no interruption to audio-visual sensory input. The basic structure is presented in this section; for a thorough walk-through, please see [37]. The platform may be broken into three components: (1) VR device, (2) graphical rendering engine, and (3) audio rendering engine. For Component 1, the HTC Vive Pro HMD and SteamVR Version 2.0 tracking system were used for positional tracking, visual presentation, and control interface. For Component 2, the game engine Unity was used for the graphical rendering, and hosted positional information for all audio objects and participant's position and orientation using the SteamVR asset. All relevant positional and rotational data were then sent (via an Open Sound Control (OSC) data package at a 10 ms interval) to Max 8. In Component 3, binaural renderers, each equipped with a unique HRTF set, were hosted in Max 8 and were fed, in parallel, the positional and rotational information received from Unity.
The interface was designed such that it could be instantiated anywhere in the VR scene. A semi-transparent panel appeared at eye level in the participants' field of view by pressing a button on a hand-held controller. Pressing the button again hid the panel, allowing the user to explore the environment fully. A virtual laser pointer was used to switch between renderers, assign ratings, and move on to the next scene.

Stimuli and Conditions
Five conditions were created for each of the 6-DoF VR scenes based on the HRTF sets: Measured, Modeled, Generic, Generic horizontal, and No HRTF. The first three conditions were the same as in Experiment II. The Generic horizontal condition was created by retaining only the measurement points on the horizontal plane from the generic HRTF set. Lastly, condition No HRTF was created by disabling the binaural rendering and playing back a mono-downmix of the scene audio that did not react to listener movements.
There were four scenes focusing on different aspects of localization accuracy (In scene, Handheld, Above, and Below), two scenes focusing on timbre (Cello and Castanets), and two scenes focusing on overall audio quality (Park and Fountain). The scenes, task in the scene, and audio content are summarized in Table 2. Visually, all of the scenes presented the participants with a large empty space where a white sphere (r = 10 cm) denoted the location of the audio object. The environment was purposefully minimalistic in appearance to provide only the necessary cues to support audio-visual sensory integration, and not distract the participants with too much visual information.
The audio object was either placed in the scene (In scene) at 1.5 m height from the floor level or attached to the hand-held controller (Handheld), allowing the participant to move it around their head freely. In the scene Above, the audio object was placed 3 m above the floor level, as opposed to being on the floor in the scene Below. In these four scenes the audio content was pink noise. In the two scenes focusing on timbre, the audio object was placed at 1.5 m height, and the participant was instructed to stand on a marker 1 m in front of the audio object. These scenes featured anechoic recordings of a cello or castanets.
The overall quality scenes were composed of two spheres denoting two audio objects placed 3 m from each other at 1.5 m height. In Park, the two audio objects were a field recording of ducks and another field recording of shrub pruning with scissor clipping sounds and the rustling of leaves. In the Fountain scene, the audio objects were a dry studio recording of piano performance and a field recording of a water fountain.
In all of the scenes, the participant's movement was tracked in 6-DoF, and the audio rendering was updated dynamically to reflect the audio object's position in relation to the listener. The update was performed by choosing the nearest HRTF pair without interpolation. The source signals were convolved with each of the HRTF sets under study in parallel in real-time. To enable accurate comparison of the HRTFs, the acoustics of the virtual space were not modeled. The omitted early reflections and reverberation remove any supporting localization cues apart from the directional information in the actual HRTFs under test. Due to the lack of virtual acoustics, the direct-to-reverberant ratio cue was not modeled. Thus, distance rendering relied only on the intensity cue realized by applying the inverse square law with a maximum level reached at 0.1 m from the sound object. Auditory near-field effects were not modeled. Similar to the static localization experiment, the audio was played back through an RME Fireface UCX audio interface and Beyerdynamic DT770 Pro headphones. Overall audio quality Two audio objects: Piano and fountain

Procedure
The scenes Above and Below were evaluated on a separate day in a follow-up experiment by a subset of 8 participants. The rest of the scenes were evaluated in one main experiment, one after the other by all of the 12 participants. The presentation order was fixed to start with localization, followed by timbre, and finishing with overall quality. The participant was presented with a multiple stimuli quality scaling task in VR, where the localization accuracy, timbral fidelity, and overall audio quality were to be rated on a 100-point continuous scale, depending on the task. There were verbal labels marking the regions of the scale as bad, poor, fair, good, and excellent in 20-point intervals. The five HRTF conditions were presented side-by-side in the VR user interface, and the participant was allowed to switch between them. The HRTF conditions and interface buttons were randomized for each scene. The interface is displayed together with the sphere denoting audio object location in Figure 4. For the localization and overall quality tasks, the participants were instructed to freely move and teleport around in the 6-DoF scene and focus on either the localizability or general audio aspects of the experience. In these scenes, there was no audio reference to compare against; instead, the participants were assumed to possess an internal reference obtained from the day-to-day experience of listening in the real world. The internal reference should inform the participant how self-movement in relation to visual cues should be reflected in the audio reproduction. In the timbre scenes, the participant was instructed to stand on a marker 1 m away from the audio object and face towards the sound. In these scenes, the No HRTF condition was defined as the timbre reference, and all the ratings are relative to this condition. The No HRTF was included in the set of conditions as a hidden reference.

Results
The data from the subjective experiments are analyzed by non-parametric Friedman tests, with the participant defined as a random effect repeated across measurements. The effect sizes are calculated using the Kendall's W metric, which describes the level of agreement among the participants. To interpret the effect size, we refer to the guidelines laid out by Cohen [38]: 0.1 ≤ W < 0.3 (small effect), 0.3 ≤ W < 0.5 (moderate effect), and W ≥ 0.5 (large effect).
The localization accuracy was found significantly different for the different HRTF sets, when the audio object was placed in the scene, χ 2 (4,N=12) = 29.6, p < 0.001, or the audio object was hand-held χ 2 (4,N=12) = 37.6, p < 0.001. The corresponding effect sizes are W = 0.62 and W = 0.78, respectively. Similarly, the localization accuracy was found significantly different for the different HRTF sets, when the audio object was above the listener, χ 2 (4,N=8) = 24.8, p < 0.001, or when the audio object was below the listener χ 2 (4,N=8) = 22.5, p < 0.001. The corresponding effect sizes are W = 0.77 and W = 0.70, respectively. Post-hoc pairwise permutation tests revealed no significant differences between the measured, modeled, and generic HRTFs in any of these scenes. These results are displayed graphically in the first four panels a, b, c, and d, in Figure 5.
The timbral fidelity was found significantly different for the different HRTF sets, when listening to an anechoic cello sample, χ 2 (4,N=12) = 29.8, p < 0.001, or when listening to an anechoic castanets sample χ 2 (4,N=12) = 35.4, p < 0.001. The corresponding effect sizes are W = 0.62 and W = 0.74, respectively. These results are shown in Figure 5 panels e and f. Post-hoc tests revealed all the other conditions to be rated significantly lower than the hidden reference No HRTF in both scenes. In the Cello scene, the generic HRTFs were rated significantly higher than the measured or modeled HRTFs. In Castanets, the measured HRTFs were rated significantly lower than all the other conditions.
The overall audio quality was found significantly different for the different HRTF sets, when experiencing the park scene, χ 2 (4,N=12) = 27.3, p < 0.001, or the fountain music scene χ 2 (4,N=12) = 29.6, p < 0.001. The corresponding effect sizes are W = 0.57 and W = 0.62, respectively. The last panels g and h, in Figure 5 display these results. In Park, all of the HRTF sets, even the Generic horizontal, were rated equally and above the No HRTF condition according to post-hoc tests. In Fountain, no difference was found between measured, modeled, and generic HRTFs, but the Generic horizontal was significantly lower than Generic.

Objective Deviations
From the analysis in Section 2, the generic HRTFs were found statistically different from either of the individual sets based on the auditory model predictions, which supports our first hypothesis. However, the measured and modeled HRTF sets were found to be objectively significantly different also from each other. If the spectral cues in measured and modeled HRTF sets were similar, we would assume the elevation error predictions, in Figure 2, for the measured template and modeled target HRTF case to be close to the matching template and target errors, as shown in [34] for the own versus other HRTF cases. This, however, is not the case; the local elevation error and quadrant errors for this mixed prediction are closer to the cases with generic HRTFs. Similarly, the ITD and ILD analysis in Figure 1 showed significant differences between all three sets. Therefore, we reject our second hypothesis about the similarity of these sets. Recent work has found corresponding results regarding deviations between simulated and measured HRTFs, where significant differences were found in perceived coloration and localization between the sets [39]. There, Brinkmann et al. [39] suspected the absorbing effect of hair and clothing, missing from the modeled HRTFs, to at least partially explain the differences. However, the absorbing effect does not explain the relatively large deviation found in ITD values at lateral positions in our analysis. Here, the anthropometric measurement process may be a source of error.

Static Localization
In Experiment I, no clear improvement was found in static localization accuracy when using individual HRTFs. Our third hypothesis receives partial support from the non-significant (p = 0.083), but moderate effect size, of reduced quadrant errors percentage with individual HRTFs (measured 19%, modeled 18%) compared to the generic HRTFs (22%). In previous literature, using static localization tasks, Middlebrooks [4] showed a reduction in quadrant errors from 21% to 8% when switching from generic to individual HRTFs, Wightman and Kistler [40] a reduction from 11% to 6% when switching from individual HRTFs to free field, and finally Wenzel et al. [2], moving from generic HRTFs to free field conditions, a change from 31% to 19% of front-back confusion percentage. However, some studies show no effect in quadrant errors rate between individual and generic HRTFs, for example, Begault et al. [8] 59% and Bronkhorst [3] 28% for both HRTF types. There, especially the level of participant training in localization tasks has been suspected to mask the differences in HRTF cues. Considering a commercial scenario, most users are likely to be untrained, non-expert listeners, which might further minimize the localization benefits of individual HRTFs. Although not showing a significant reduction in quadrant errors percentage, our results appear to be in line with what may be expected to be the range of quadrant errors in static localization tasks using individual or generic HRTFs.
After cleaning the data from quadrant errors, the azimuth error angles should be approximately the same between HRTFs and free-field listening based on previous research [2,3,40]. For headphone localization tasks using either generic or individual HRTFs, the values found in previous research range from 15 • to 22 • [4,8]. These values may be contrasted with free field localization results, which yielded azimuth errors of 11 • on average in [4]. The results for average local azimuth errors in our static localization experiment varied between 17 • and 21 • depending on the HRTF set, which is in line with previous results. Similar to previous work, we found no statistically significant difference between the HRTF sets in the average azimuth error.
However, previous studies on static localization comparing individual and generic HRTFs have found significant differences in elevation localization accuracy in favor of the individual HRTFs [3,4], but also a lack of an effect is reported [8], where an average elevation error value of 18 • was found for both individual and generic HRTFs. In our static localization experiment, no significant reduction in unsigned elevation errors was observed when switching from generic HRTFs to either of the individual sets, which is in contrast with the majority of previous results. The values we found for average unsigned elevation errors ranged from 21 • to 26 • for the different HRTF sets. Some exemplary elevation error values found in the literature for static localization are 10 • or 21 • in the free field [4,9], 11 • , 29 • , or 37 • using individual HRTFs [4,9,41], and 42 • using generic HRTFs [4]. Comparing absolute localization accuracy results between studies is challenging due to different stimulus signals, target locations, pointing methods, visual settings, and participant training. Despite the lack of reduction in relative unsigned elevation errors between the HRTF sets, the values found in our study are in line with the previously found error values and more closely match the results found in free field or individual HRTF listening. One potential reason for the lack of significant results here could be an adaptation to the generic HRTFs since the static localization experiment was conducted after the dynamic 6-DoF VR experiment, although there was a lengthy gap between the experiments (12 months).

Dynamic Results
The lack of significant differences between the individual and generic HRTFs in dynamic localization accuracy and overall audio quality in the 6-DoF VR experiment supports our fourth hypothesis regarding the similarity of the HRTFs in this setting. This finding is corroborated by previous research in dynamic localization, where there are often no differences found between individual and generic HRTFs [5,8].
However, a few recent studies also show the contrary. Ben-Hur et al. [12] utilized a constricted movement range where head rotations were allowed, but the sound playback would stop outside of the range. The audio targets were placed beyond the allowed movement range so that dynamic cues could be incorporated in the localization process, but the localization accuracy could be tested for source directions other than only in front.
Ben-Hur et al. [12] found a significant localization improvement with individual HRTFs over generic HRTFs, which is in contrast to the findings of our study. Their experimental setup did not provide visual cues or motor cues beyond the limited movement range, possibly explaining part of the differences.
Another study using audio-visual VR showed that individual HRTFs were preferred in localizability and realism over a generic HRTF set when a target object was a drone flying overhead along the sagittal plane in a pre-programmed animation [13]. This finding is in contrast to our dynamic experiment results. One reason for the contrasting results could be the dynamic audio object and seated participant employed by Jenny and Reuter [13] in contrast to emphasizing the movement of the listener. A dynamically moving audiovisual object has been found to reduce listener movement in 6-DoF VR audio quality evaluations [42], which might lessen the adaptation to generic HRTF cues, but this assertion will require further studies.
The lack of perceptual differences in dynamic localization accuracy is surprising, keeping in mind the different number of measurement points in the individual sets (Q = 4608) compared to the generic set (Q = 1550). The difference stems from the elevation resolution since the azimuth resolution was the same for all sets (5 • ). In the localization scene with the audio object in the scene at 1.5 m height, even the Generic Horizontal condition with Q = 72 measurement points could not be statistically differentiated from the Generic set or the Modeled set, further highlighting the strength of visual influence on auditory localization. On another note, in dynamic localization without visual cues, a sparse HRTF set with just Q = 36 measurement points covering the whole sphere combined with linear interpolation for intermediate locations was shown to yield comparable localization accuracy with a high-resolution HRTF set with Q = 612 [12]. Our experiments did not employ an interpolation strategy, which might reduce possible differences between the HRTF sets with different numbers of measurement points. Using the 100-point localization accuracy scale with intermediate quality labels, the individual and spherical generic HRTF sets were evaluated in the good (60-80 points) and excellent (80-100 points) localizability range. Although the HRTF sets are to some extent affecting each others' relative scores in the absence of an auditory reference, it may be assumed that the visual and motor cues provided sufficient reference conditions for an absolute judgment of the localization accuracy for the experienced listeners in our experiments. This result adds confidence that the sound delivery chain comprising the diffuse-field equalization of the HRTFs and headphones, can convey a convincing spatial impression.
The lowest static and dynamic localization errors for virtual sounds in recent studies have resulted from experiments conducted with floating headphones, where in-ear headphones are mounted in front of the open ear canal entrance and individual headphone calibration is performed at the beginning of each experiment [9,12], using a procedure originally described in [43]. This procedure is different from the diffuse-field calibration performed in the present experiments and suggested in the previous literature [4,30]. Although yielding accurate localization results in the absence of visuals, the individual calibration procedure is laborious and unfeasible with current headphone technology for the general public. The lack of differences between generic and individual HRTFs in this study in a 6-DoF VR context add evidence to the cross-modal integration and adaptation effects on sound localization, possibly circumventing the need for high fidelity individual spatial cues delivery.
The preference in timbre in favor of the generic set is in agreement with previous research [16], where the effect of higher quality microphones in the generic binaural head was suspected to be one reason for the higher perceived timbre fidelity. In the study at hand, defining the No HRTF condition as the timbre reference may cause potential confusion in the participants since any of the HRTF conditions is spatially different from the reference. Moreover, the HRTFs cause timbre coloration by definition, making the comparison to a non-filtered stimulus unfair. It might be argued that the stronger localization cues the HRTF can convey, the more different the timbre will be from a non-spatial stimulus. In informal discussions with the participants, all of them had understood the goal of comparing the nonspatial timbre with a spatialized version, requiring a certain degree of mental simulation of how a real sound source should sound like if placed in front of the listener. The timbre results should only be viewed together with the localization and overall quality results to attain an accurate picture of the quality of the HRTF sets.
The overall quality results support our findings in localization and timbre: the individual and generic HRTF sets are perceived to be highly similar in quality. Here, the participants were allowed to focus on any aspect of the auditory experience while keeping in mind the boundary conditions set by the visuals and motor actions. The scenes were also more complicated than the previously presented scenes in this experiment, both in the number of audio objects and their content. Still, it may be argued that the employed scenes were far from a scene that might be experienced in the real world. In a real scene or a more realistic virtual scene, there would most likely be multiple competing audio objects active simultaneously and blending into a holistic perception of a soundscape. Then, given a visual scene to match the soundscape, the contribution of any individual audio object's auditory localizability to the overall experience would most likely be small. The experiments in this study were conducted in anechoic conditions to highlight the impact of HRTF fidelity on localizability and overall audio quality, representing the worst-case scenario for virtual audio. Based on previous literature, we may assume that adding early reflections and reverberation would further reduce the prominence of differences between individual and generic HRTF sets since these cues affect the spatial impression and externalization [8,44]. Furthermore, more realistic acoustic conditions would ease the integration of visual and auditory environments by reducing the room divergence effect [45], potentially reinforcing the adaptation to generic localization cues.
In summary, our results showed significant objective differences between three HRTF sets, but minimal perceptual differences in dynamic 6-DoF VR scenes. In future, instead of focusing on individualizing the HRTFs, we see potential in systematically adapting the user to listen with generic HRTFs and to listen to spatial audio in general. Recent evidence shows using audio-visual-motor cross-modal training to further improve the adaptation effectiveness compared to audio-visual training [26]. Such a scenario is applicable in virtual reality, where the hands are tracked in addition to the head.

Conclusions
This study compared measured and modeled individual HRTF sets with a generic HRTF set and found significant objective differences based on auditory model predictions between all three sets. A static localization experiment showed minor differences in favor of the individual HRTFs over the generic HRTFs in quadrant errors percentage, but a localization and overall quality experiment in 6-DoF audio-visual VR revealed no difference between the HRTF sets. The experimental results suggest there is no need to employ individual HRTFs when applied in a multimodal and dynamic environment.  Institutional Review Board Statement: Ethical review was not sought for the present study due to the common research ethics standards of the relevant scientific societies as well as survey research. The experiments do not pose any particular risks, i.e., participation in the study does not produce harm or discomfort beyond everyday experience.
Informed Consent Statement: Informed consent was obtained from all participants prior to the study. Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.