Binaural Rendering with Measured Room Responses: First-Order Ambisonic Microphone vs. Dummy Head

: To improve the limited degree of immersion of static binaural rendering for headphones, an increased measurement effort to obtain multiple-orientation binaural room impulse responses (MOBRIRs) is reasonable and enables dynamic variable-orientation rendering. We investigate the perceptual characteristics of dynamic rendering from MOBRIRs and test for the required angular resolution. Our ﬁrst listening experiment shows that a resolution between 15 ◦ and 30 ◦ is sufﬁcient to accomplish binaural rendering of high quality, regarding timbre, spatial mapping, and continuity. A more versatile alternative considers the separation of the room-dependent (RIR) from the listener-dependent head-related (HRIR) parts, and an efﬁcient implementation thereof involves the measurement of a ﬁrst-order Ambisonic RIR (ARIR) with a tetrahedral microphone. A resolution-enhanced ARIR can be obtained by an Ambisonic spatial decomposition method (ASDM) utilizing instantaneous direction of arrival estimation. ASDM permits dynamic rendering in higher-order Ambisonics, with the ﬂexibility to render either using dummy-head or individualized HRIRs. Our comparative second listening experiment shows that 5th-order ASDM outperforms the MOBRIR rendering with resolutions coarser than 30 ◦ for all tested perceptual aspects. Both listening experiments are based on BRIRs and ARIRs measured in a studio environment.


Introduction
Typically, binaural rendering involves a convolution of source signals with measured or modeled head-related impulse responses (HRIRs) or binaural room impulse responses (BRIRs) and playback of the corresponding ear signals via headphones [1].Both HRIRs and BRIRs implicitly contain the cues accessible to the human auditory system to perceive sound from a certain direction and distance, with a certain source width, envelopment, or spaciousness, cf.[2,3].In order to reduce poor externalization when using both individual and non-individual HRIRs, it is often helpful to involve a natural or simulated acoustic room and thus to render with BRIRs instead.It is shown in [1,4] that BRIRs measured with a loudspeaker and a dummy head can achieve static binaural rendering of high audio quality and of convincing realism.Such a BRIR-based virtualization of loudspeakers in rooms is not only useful to virtualize multi-channel loudspeaker setups in mixing studios [5][6][7], but also to document and preserve acousmatic or electroacoustic music sceneries.
By involving the natural interaction of the ear signals with the head rotation of a listener, i.e., head-tracking [8] and dynamic rendering, immersion is improved as this reduces localization ambiguities and poor externalization [9][10][11].Dynamic and interactive head-tracked BRIR rendering requires the acquisition of MOBRIRs (multi-orientation BRIRs), which can be tedious for highly resolved orientations.For best individualized results, in particular, one would need to measure MOBRIRs of each individual listener in each room to be auralized.Efficient and versatile alternatives [12][13][14] propose to measure the listener-dependent (HRIRs) and room-dependent (RIRs) parts separately to enable individualization as a second step.
State of the art: Linear interpolation of coarse-orientation MOBRIRs can cause strong comb-filter artifacts.Lindau [15] showed for multiple-orientation binaural recordings that the minimum required binaural grid resolution to avoid artifacts is most sensitive in anechoic conditions, and less sensitive reverberant cases, in which particular reverberation did not matter.To ensure continuous and robust interpolation from orientations coarser than 3 • , a dual-band interpolation strategy is required, which literature refers to as motion-tracked binaural (MTB) [16].At low frequencies, the dual-band approach interpolates the headphone signal from neighboring pairs of recorded ear signals linearly, while comb filters at high frequencies are avoided by combining interpolated spectral magnitudes with suitable phase values, e.g., found by spectrogram inversion [17].Less challenging approaches yielding a suitable phase are discussed in [18] and perceptual properties are studied in [19].Perceptually optimal cross-over frequencies and block sizes were investigated in [20] for the static and dynamic case, with which rendering from MOBRIRs resolved finer than 30 • was found to be indiscernible from a 1 • -resolved reference.
Dynamic rendering based on first-order Ambisonic RIRs (ARIRs) and a pre-measured set of high-resolution HRIRs, e.g., of a dummy head [21], is studied in [22].Static rendering with ASDM (Ambisonic Spatial Decomposition Method) upscaling was shown to yield perceptually indistinguishable results for 7th order when compared to a reference dummy-head BRIR.Moreover, the study involved three rooms of different reverberation times (0.3 s, 0.7 s, and 1.4 s) and could show that the performance of the ASDM method did not depend on the particular reverberation time.
Contents: ARIR-based and MOBRIR-based rendering haven't been compared yet, and our contribution deals with establishing a balanced comparison.
The goal is to find out configurations in which both methods yield perceptually optimal or correspondingly scaled results.Some correspondence is expected, as both the binaural Ambisonic rendering of ARIRs using MagLS [22,23] and the interference-avoiding high-frequency strategy of MOBRIR-based rendering [20] rely on spectral phase simplification at high frequencies, and both relate to an angular resolution.To make the comparison reproducible, a room impulse response data set (dummy head and Ambisonic microphone) is measured in a studio environment and made accessible in this contribution.As the main part of the paper, Section 3 is dedicated to our comparative listening experiment on variable-orientation rendering from MOBRIRs resolved in {30 • , 45 • , 60 • } steps and corresponding ASDM-based Ambisonic orders {5, 3, 1} rendered using the same dummy head HRIRs [21].
In both listening experiments, participants are asked to rate for static rendering the perceptual attributes (i) timbre, (ii) spatial mapping, and for dynamic rendering (iii) its continuity.Both experiments compare the renderers to coarse linear MOBRIR interpolation (anchor) and to linear interpolation from 1 • MOBRIRs (reference).
Measurements: The RIR measurements used here are available online (https://opendata.iem.at/projects/binauralroomresponses/) and were taken from the IEM production studio (volume 127 m 3 , base area 42 m 2 , T 60 ≈ 0.4 s) in which Neumann (Germany) KH310-A loudspeakers are mounted in various directions.MOBRIRs were measured with a Neumann KU100 dummy head in rotations of 1 • steps (turntable) using the exponentially swept sine technique.Available loudspeaker directions that are depicted in Figure 1a, Figure 1b show the dummy head in the center listening position, facing the center loudspeaker (channel 3).The B-format ARIRs were measured after replacing turntable and dummy head with the Soundfield ST450 array.The room was selected for studying MOBRIR interpolation and binaural Ambisonic rendering as the presence of its short reverberation already supports externalization in typical listening environments and its pronounced direct and early parts are expected to be critical considering both timbral artifacts and spatial mapping deficiencies [15,24].

Experiment I: Dynamic Rendering of Multiple-Orientation Binaural Signals
Experiment I is based on data of a previous study [20], and it evaluates the dual-band strategy linear interpolation with switched high-frequency phase (LISHPh) using dummy-head MOBRIRs of the resolutions 8 • , 15 • , 30 • , and 60 • , and compares it with a reference linear interpolation of MOBRIRs resolved by 1 • .The LISHPh method is described in Section 2.1, and the design and implementation of the listening experiment are discussed in Sections 2.2 and 2.3, respectively.Response data (https: //opendata.iem.at/projects/listening_experiment_data/) and examples of the audio stimuli (https: //phaidra.kug.ac.at/view/o:77319) are made available for download; examples include renderings of static head-orientations and of emulated continuous head rotation.Finally, the results of Experiment I are discussed in Section 2.4.

Linear Interpolation with Switched High-Frequency Phase (LISHPh)
For both the left and right ear, the interpolated ear signal in a horizontal set of orientations is obtained by a combination of the corresponding signals x q (t) and x q+1 (t) belonging to the head orientation closest to the current orientation of the listener ϕ(t), where t is the discrete-time index.With MOBRIRs measured for Q equi-angular orientations on the horizon (around the Cartesian z-axis), and ∆ϕ as azimuthal resolution, the indices of the two closest BRIRs (or recorded ear signals) are q = ϕ(t) ∆ϕ , and q + 1 = ϕ(t) ∆ϕ , where • and • are the floor and ceil functions, respectively, and the ear signals x q (t), and x q+1 (t) are obtained by convolution; see Figure 2a.
In a broadband linear interpolation, the resulting ear signal is obtained as where the interpolation weight is obtained by α = ϕ(t) ∆ϕ − ϕ(t) ∆ϕ .However, the linear combination of delayed signals produces comb-filtering introducing severe colorations in the resulting signal.In particular, the maximum delay τ = r c sin(∆φ) between adjacent HRIRs is estimated by a simplistic head model, where r = 8.5 cm is the head radius and c = 343 m/s is the speed of sound.The maximum delay is observed between ear signals of the head orientation 0 • and those of the orientations ±∆ϕ, for a frontal source.To avoid destructive interference, artifact-free linear interpolation can only be achieved below Spectral artifacts of linearly interpolated BRIRs are comparable with those of HRIRs when the direct sound dominates.For interpolation of BRIRs from a diffuse field, the same (worst-case) frequency limit for destructive interference holds.In particular, if the contribution of frontal sounds is pronounced, the interpolated result partly contains the destructive interference at f max .
The LISHPh method is employed [16,18,20] to avoid noticeable spectral artifacts around and above f max , regardless of the acoustic scenario.As depicted in Figure 2b, the signal in the lower band is processed in the time domain by applying the linear weights as in Equation ( 1), and the signal in the high-frequency band is obtained by magnitude interpolation where k is the frequency index of a short-time Fourier transform (STFT) frame, i 2 = −1, and (k) is the phase argument which is switched between (k) = q+1 (k) for α ≥ 0.5 and (k) = q (k) else.Whenever the phase argument of a narrow-band signals has to make a transition by π, switching can theoretically become audible and can only be avoided by spectrogram inversion [17,18].We avoided the additional effort, as the negative influence of the switching noise turned out to be inaudible for speech, music recordings, noise, etc. with suitable block size settings [20].

Listening Experiment: Design
We tested the LISHPh method rendering from MOBRIR resolutions of ∆φ = {8, 15, 30, 60} • and also included the broadband linearly interpolated ear signals for comparison.During the listening experiment, each listener was asked to (i) rate the spatial mapping (i.e., direction and distance) and timbre compared to a reference condition for static rendering (four different head orientations), and (ii) to rate the continuity or robustness when rendering dynamically, i.e., incorporating head movements of the listener.
x q+1 (t) x q (t) (a) Filter-switching and convolution to obtain the two neighbouring ear signals which are closest to the current head rotation.
x q+1 (t) x HP (t)  The 10 test participants (all male and at an age between 27 and 42) were asked to rate the overall difference between a reference (artifact-free linear interpolation with 1 • resolution) and the test signals on a continuous scale from poor to very good.A hidden reference was used for screening of ratings, and thus the test procedure can be described as MUSHRA-like (multi stimulus with hidden reference and anchor [25]).The test signals were continuously looped, and participants were allowed to seamlessly switch between signals in real-time as often as desired.
In terms of timbre and spatialization, we tested four different static head orientations ϕ = {12, 21, 37, 78} • (the sign of the orientation was randomly changed across participants) for a frontal source position with pink noise and music as source signal, respectively.The choice of the source position and orientations was met to make the experiment most sensitive to the expected interpolation artifacts: on the one hand, time-delays (phases or ITDs) are most head-orientation-dependent for predominantly frontal source positions, and, on the other hand, orientations were selected to enforce interpolation with 0.2 < α < 0.8 for all MOBRIR sets under test.
Testing the continuity involved a pink noise and a music signal played back over a virtual frontal (0 • ) and lateral (90 • ) loudspeaker.Here, listeners were asked to rotate their head between φ = −45 • . . .45 • and a check-box for automatic rotation with 180 • /1s was included for fast movements.

Listening Experiment: Implementation and Settings
The real-time implementation of the LISHPh, as well as the broadband linear interpolation, was done in pure data (https://puredata.info/),an open source real-time audio software.Appendix A describes the example implementation provided online (https://phaidra.kug.ac.at/o:97087).
Block processing: In the short-time block processing with block size N and hop size L = N/2 of the high-frequency part, a sine half-wave window is applied at both the analysis and synthesis stage to reduce cyclic artifacts at the block boundaries.As found in [20] and as the optimum for broadband musical sounds, it is crucial to keep the block size low to avoid temporal artifacts and to obtain low latency.Thus, we suggest setting N = 128 at a sampling rate of 44.1 kHz.For high-frequency phase selection, we used an update rate of 200 Hz; see the block labeled as Selector in the diagram of Figure 2a.
Crossover: We estimate the spectral ripple of two interfering signals with phase offsets by To keep the spectral ripple below 3 dB, we require a phase difference φ < π 2 , or φ < π 4 to keep it below 0.7 dB, and hence spectrally inaudible [26].With Equation (2) or a rule of thumb f max ≈ 2 kHz 57.3 • ∆ϕ , the phase difference is φ = π f f max in our case.Spectrally, good results are achieved with a crossover frequency f c ≈ f max 4 .However, setting it too low, e.g., f c < 1.5kHz, impedes interaural phase cues in a relevant frequency range.Accordingly, the choice offers a reasonable trade-off, cf.[20].For ∆φ = {8, 15, 30, 60} • , we obtain the crossovers at The crossover is implemented by 4th-order Linkwitz-Riley filters because of their in-phase sub bands [27,28].
For playback, we used AKG K702 headphones equipped with the IEM headtracker [8] and the experiment was conducted in a quiet office room.

Results and Discussion
The statistical analysis below uses pooled data for each attribute; see Figure 3.For timbre and spatial mapping, ratings of four directions were pooled, and both virtual loudspeaker directions were pooled for continuity.Please note that throughout the article we use a Wilcoxon signed-rank test [29] with a Bonferroni-Holm correction [30] to determine p-values of pair-wise comparisons between test conditions and define p < 0.05 as significantly different.We employ non-parametric statistics as we do not assume a normal distribution of the ratings due to severe clustering at the limits of the scale.Timbre: Per MOBRIR resolution, there is a clear advantage of the LISHPh method in the settings proposed compared to broadband linear interpolation.The p-values (significance level) given in the upper triangle of Table 1 indicate that there are four groups, which are significantly different from each other.The LISHPh interpolations with settings 8 • /4k and 15 • /2k are not significantly different (p = 0.11) to the 1 • reference condition and perform significantly better than all other conditions.For the coarser resolutions, the quality of the LISHPh interpolation decreases significantly with spacing.However, for all orientation resolutions, LISHPh performs significantly better than linear interpolation (p < 0.005).The best linear interpolation conditions 8 • lin and 15 • lin are comparable to the worst LISHPh condition 60 • /1k (p > 0.36).1.For coarser resolutions, the quality of spatial mapping decreases significantly.Again, the LISHPh conditions clearly outperform the linear interpolation.However, the best linearly interpolated conditions 8 • lin and 15 • lin are significantly better than the worst LISHPh condition 60 • /1k.This is caused by its low crossover frequency at f c = 1kHz which is set to avoid comb filtering (cf.Equation ( 5)) but already distorts the interaural time difference (ITD) in a sensitive frequency range [31].
Continuity: The results for perceived continuity are depicted in Figure 3b.While there is clear absolute difference depending on the source signal (yellow vs. blue line-pink noise vs. music), the trend is similar.For both signal types, all of the LISHPh conditions except 60 • /1k do not significantly differ from the reference and from each other; see Table 2. Coarser MOBRIR resolutions lead to a decrease in quality, and LISHPh conditions clearly outperform linear interpolation of corresponding resolution.The improved continuity for 60 • lin compared to the denser MOBRIRs can be explained by its reduced timbral variation when rotating the head, which seemed more important to listeners than spatial mapping.

Experiment II: Dummy-Head MOBRIR vs. ARIR
In Experiment II, we evaluate and compare the perceptual aspects of MOBRIR (LISHPh) and ARIR-based dynamic rendering (ASDM).The concept of ARIR-based rendering and the relevant signal processing involved to accomplish upscaling are described in Section 3.1.A description of the listening experiment, the implementation, and the corresponding discussions are presented in Sections 3.2-3.4,respectively.Appendix B shows the MATLAB implementation of the ASDM upscaling.Audio examples (https://phaidra.kug.ac.at/view/o:77319) of the material used in the listening experiment as well as its response data (https://opendata.iem.at/projects/listening_experiment_data/) are available for download.The examples include renderings of static head-orientations and of emulated continuous head-orientations.

Rendering with Measured Ambisonic RIR and the Ambisonic Spatial Decomposition Method (ASDM)
As depicted in Figure 4, dynamic binaural rendering from room responses measured in Ambisonics (ARIRs) is modular and consists of three blocks: (i) a multi-channel convolution of the source signal with an order N upscaled ARIR, (ii) an efficient rotation [32,33] corresponding to the head orientation of the listener, (iii) and multi-channel convolution with an Ambisonic binaural renderer.
For efficient and low-effort measurements of the ARIR, we use a compact first-order tetrahedral spherical microphone array and denote the discrete-time B-format ARIRs h(t), x(t), y(t), z(t) as the responses of a Soundfield ST450 array.Similar to the spatial decomposition method SDM [34], the Ambisonic SDM (ASDM) assigns a direction of arrival (DOA) θ(t) to each discrete-time sample t of the omni-directional RIR component h(t), cf.[22].For the DOA estimation, we suggest using the pseudo-intensity vector (PIV) [35] for the frequencies between 200 Hz and 3 kHz.Here, the upper frequency limit is chosen below the spatial aliasing frequency f a = c 2πr ST450 ≈ 3.6 kHz for r ST450 = 1.5 cm and the low cut minimizes low-frequency disturbance in the DOA estimation.We perform a zero-phase band limitation (e.g., by MATLAB's filtfilt) denoted by F 200−3k and a zero-phase temporal smoothing F L of the resulting PIV using a moving-average Hann window in the interval [−L/2; L/2] for L = 16 to get the DOA estimate as Cartesian unit vector θ(t).
In a first step, the ASDM-upscaled ARIR re-encodes every time sample at the detected DOA where Y m n (θ) are the N3D-normalized, real-valued spherical harmonics of order n and degree m, cf.[23], evaluated at the direction θ, and the maximum order n ≤ N can be chosen freely.In the late diffuse part of the response, the implicit assumption of there being only a single DOA per time sample does not hold.As a result, a fluctuation of the DOA θ(t) causes amplitude modulation and destroys narrow-band spectral content in hnm (t); typically, the longer low-frequency reverberation tails are hereby mixed towards higher frequencies, causing unnaturally long reverberation there [12,36] at high orders, cf.solid lines in Figure 5.However, theoretically, the expected temporal energy decay in an ideal (isotropic) diffuse field must be identical for any receiver of random-energy-efficiency-normalized directivity, such as the spherical harmonics, also after decomposition into frequency bands, and hence requires correction.
Despite the mismatch, formal derivation in [22] showed that quadratic summation across same-order spherical harmonics is omnidirectional 4π |h(t)| 2 across all spherical harmonic orders n, for any sound field.To enforce consistency with spectral squares of h(t), third-octave filtering is useful, where the bth sub-band signal F b {h(t)} with center frequency f b is obtained from a bank of zero-phase filters F b that is perfectly reconstructing h(t) = ∑ b F b {h(t)}.
For every sub band b and order n, an energy decay of the ASDM-upscaled ARIR F b { hnm (t)} matching with the original one of F b {h(t)} is enforced by envelope correction with , where F T {•} denotes temporal averaging with a time constant T (e.g., 100 ms).The energy decay reliefs for the initial and corrected result of ASDM are exemplary shown for a third octave f b = 2 kHz and within the orders n = {1, 3, 5, 7} in Figure 5.  6) and (7), respectively.
Finally, the ear signals are obtained by a of the rotated Ambisonic signals with any state-of-the-art FIR binaural Ambisonic renderer, cf. Figure 4. Our study employs the time-invariant filters of the MagLS (magnitude-least-squares) renderer (The MagLS renderer is part of the IEM plugin suite which can be found here https://plugins.iem.at/)defined in [24,37] to get high-quality Ambisonic rendering already with an order N = 3.The filters were designed using a magnitude-learst-squares optimization that disregards phase match in favor of an improved HRTF magnitude at high frequencies, and hereby avoids spectral artifacts.MagLS also includes an interaural covariance correction that offers an optimal compromise to render diffuse fields consistently [23].

Listening Experiment: Design
Similar to Experiment I, listeners were asked to rate the spatial mapping, coloration, and continuity compared to the 1 • reference.The test conditions included ARIR rendering with the ASDM target orders N = {1, 3, 5} as well as the corresponding MOBRIR resolutions ∆ϕ = {60, 45, 30} • rendered with LISHPh and broadband linear interpolation.Note that the set of MOBRIR resolutions is derived from the number of loudspeakers used for Ambisonics reproduction [23] in practice ∆ϕ N ≈ 180 • N+1 ; for N = 1, we chose 60 • instead of 90 • to maintain a reasonable MOBRIR resolution.
We asked to rate the timbre differences, and consistency of spatial mapping for the five static listener orientations ϕ = {0, −35, 12, −15, 22.5} • which were switched automatically in 900 ms intervals and are restarted at the beginning of every audio loop.The orientations where chosen such that 0.25 < α < 0.83 for all resolutions in order to test for high interpolation depth; ϕ = 0 • is included as reference orientation, and it marks the start of each loop.As source positions, we used a frontal and lateral virtual loudspeaker, cf.loudspeakers 3 and 7 in Figure 1a.The continuity test to compare dynamic rendering was similar as in Experiment I; see Section 2.2.

Listening Experiment: Implementation
While ARIR-based rendering could be implemented in a multichannel DAW (e.g., Reaper) by using freely available convolution (http://www.matthiaskronlachner.com/?p=1910), rotation, and rendering plug-ins (https://plugins.iem.at/),there is no tested and easy-to-use plug-in for MOBRIR rendering, yet.To rule out any effects due to different implementations, we used the pure data implementation as described in Section 2.3 to also emulate ARIR-based rendering.To this end, we evaluated the ARIR BRIRs according to [22] to get a ∆ϕ = 1 • MOBRIR (for each ear) where b nm (t) is the FIR binaural Ambisonic renderer, h nm (t) is the ARIR of the order N, q is the orientation index, and r mm n is the Ambisonic rotator.As binaural renderer b nm (t), we employed the one from [37] with KU100 HRIRs measured for 2702 directions [21].The resulting AMOBRIR was linearly interpolated like the reference condition.
For playback, we again used AKG K702 headphones equipped with the IEM headtracker [8] and the experiment was conducted in a quiet office room.

Results and Discussion
The results of the listening experiments with nine participants (all male experienced listeners with normal hearing, and between an age of 27 and 57) are depicted in Figure 6 and are discussed in detail below.
Timbre: While most of the conditions are rated significantly poorer than the reference, the following are not: LISHPh with 30 • /2k and ARIR rendering with the ASDM-upscaled orders 3 and 5, cf.upper triangle in Table 3. ARIR rendering with 5th order is generally rated best; however, it is not significantly different to the 30 • /2k and 3rd-order conditions.The timbral quality decreases with both orientation resolution and lower order, and thus the 60 • /1k and 1st-order conditions yield significantly lower quality.The broadband linear interpolation conditions received the poorest ratings and significantly differ from all other conditions, with the exception of 60 • /1k and 1st order.Spatial Mapping: While the general trend is similar to the timbre results, for spatial mapping only the 1st order and all linear interpolations are significantly poorer than the reference, cf.lower triangle in Table 3. Again, the 5th-order ARIR rendering is rated highest, albeit not significantly better than the 3rd-order ARIR and all LISHPh conditions (p > 0.68).The 30 • , 45 • , and 60 • linear interpolations are significantly outperformed by all other conditions except the 1st-order ARIR rendering.Please note that we tested static directions different from Experiment I and even though the trend is similar to Figure 3a, the 30 • /2k is not significantly different from the reference condition here.This can be addressed to participants not always rating the reference highest in Experiment II.

Continuity:
The ratings of the continuity, i.e., robustness of source position and timbre to head rotations, are depicted in Figure 6b and Table 4, respectively.Tendentially, quality ratings are higher for music compared to pink noise as source signal.Independent of the source signal, the 5th-order and 3rd-order ARIR conditions as well as all LISHPh conditions do not significantly differ from the reference condition (p > 0.15).Again, all linearly interpolated conditions and the 1st-order condition perform poorly, are significantly different to all other conditions, and are similar to each other.

Conclusions
We evaluated two fundamentally different measurement-based binaural audio rendering strategies in a novel comparative listening experiment: The dummy-head-based strategy employs binaural impulse responses measured in multiple orientations (MOBRIRs) and hereby contains the required set of binaural cues of the dummy head for dynamic (head-tracked) rendering.The Ambisonics-based strategy uses the room impulse response measured by a first-order Ambisonic microphone (ARIR) in a single orientation, which is upscaled from its weak directional resolution to higher orders using the Ambisonic spatial decomposition method (ASDM).Dynamic binaural rendering is then accomplished separately through an Ambisonic rotator and binaural renderer.
Our experiment successfully compared the perceptual performance of both strategies, for static rendering in terms of timbre and spatial mapping, and for dynamic rendering concerning the resulting temporal overall.We found that the 5th-order Ambisonics-based rendering strategy (ASDM) outperformed the dummy-head-based rendering for resolutions coarser than 30 • .By this and by its clear separation of room-related from head-related aspects, we consider ASDM binaural rendering as the versatile high-quality option for dynamic binaural rendering based on measured room responses.We published audio examples, an example implementation, and all experimental response data for reproducible research.
Concerning the dummy-head-based strategy with MOBRIRs, we summarized the analysis and made available all experimental response data from previous experiments [20].The results indicate that linear interpolation between the different dummy-head orientations is always outperformed by the linear interpolation and switched high-frequency phase method (LISHPh).This processing strategy achieved a convincing rendering quality with an orientation resolution of 15 • and 30 • , when compared to a 1 • linearly interpolated reference.
The underlying RIR measurements were taken from the IEM production studio with a reverberation time of T 60 ≈ 0.4 s.This specific choice of room was found suitable to study the MOBRIR interpolation and ASDM binaural rendering as its pronounced direct and early parts are expected to be critical considering specific timbral or spatial mapping deficiencies, and its reverberation is suitable to evoke externalized impressions in typical office environments.Note that neither of the investigated rendering strategies was specifically optimized for the specific room and signals.Although not formally tested, we assume the results to hold for a variety of other acoustic environments.An example patch and a set of more reverberant BRIRs (RT = 2.8 s) are provided online (https://phaidra.kug.ac.at/o:100863).

Appendix B. ASDM MATLAB Source Code
The MATLAB source code of the proposed ASDM method can be found in Listing 1. Please note that you need to install the spherical harmonic transform which can be accessed from https://github.com/polarch/Spherical-Harmonic-Transform.
Loudspeaker layout of the IEM production studio.(b) KU100 dummy head in the center of the loudspeaker setup.
b) Block diagram of the LISHPh interpolation method.The lower band is interpolated linearly in the time domain, while the high-frequency band is interpolated in the short-time Fourier domain.The high-frequency signal is obtained by a magnitude interpolation and the phase for reconstruction is chosen dependent on α (interpolation weight).

Figure 2 .
Figure 2. Block diagram for BRIR selection, convolution, and continuous interpolation for one ear in a dynamic binaural rendering scenario.

Figure 3 .
Figure 3. Median (markers) and 95% confidence interval (solid lines) of ratings from all 10 subjects for testing the perceived difference to the reference (linearly interpolated BRIRs on a 1 • resolution).Settings of the algorithm are indicated by ∆ϕ/ f c , where lin denotes a broadband linear interpolation.

Figure 5 .
Figure 5. Energy decay relief (EDR) in a third-octave band with center frequency of 2 kHz.Solid and dashed lines indicate the order partitioned EDR before and after equalization as defined in Equations (6) and(7), respectively.
Timbre and spatial mapping of the pooled data for all tested directions for a frontal source and head orientations of ϕ = {0, −35, 12, −15, 22.5} • .Continuity of the pooled data (virtual speaker at front and side, respectively).

Figure 6 .
Figure 6.Median (markers) and 95% confidence interval (solid lines) of ratings from all nine subjects for testing the perceived difference to the reference (linearly interpolated BRIRs on a 1 • resolution).Settings of the algorithm are indicated by ∆ϕ/ f c , where lin denotes a broadband linear interpolation.

Figure A1 .
Figure A1.Implementation of the high-frequency patch in Pd.

Table 1 .
p-values (Wilcoxon signed-rank test with Bonferroni-Holm correction) for ratings of timbre and spatial mapping of Experiment I.The upper triangle corresponds to timbre, the lower triangle to spatial mapping.Insignificant differences (p-values ≥ 0.05) are indicated by bold numbers.

Table 2 .
p-values (Wilcoxon signed-rank test with Bonferroni-Holm correction) for ratings of continuity in Experiment I.The upper triangle corresponds to the pink noise, the lower triangle to music as source signal.Insignificant differences (p-values ≥ 0.05) are indicated by bold numbers.

Table 3 .
p-values (Wilcoxon signed-rank test with Bonferroni-Holm correction) for ratings of timbre and spatial mapping of Experiment II.The upper triangle corresponds to timbre, the lower triangle to spatial mapping.Insignificant differences (p-values ≥ 0.05) are indicated by bold numbers.

Table 4 .
p-values (Wilcoxon signed-rank test with Bonferroni-Holm correction) for ratings of continuity in Experiment II.The upper triangle corresponds to the pink noise, the lower triangle to music as source signal.Insignificant differences (p-values ≥ 0.05) are indicated by bold numbers.