Auralization of High-Order Directional Sources from First-Order RIR Measurements

: Can auralization of a highly directional source in a room succeed if it employs a room impulse response (RIR) measurement or simulation relying on a ﬁrst-order directional source, only? This contribution presents model and evaluation of a source-and-receiver-directional Ambisonics RIR capture and processing approach (SRD ARIR) based on a small set of responses from a ﬁrst-order source to a ﬁrst-order receiver. To enhance the directional resolution, we extend the Ambisonic spatial decomposition method (ASDM) to upscale the ﬁrst-order resolution of both source and receiver to higher orders. To evaluate the method, a listening experiment was conducted based on ﬁrst-order SRD-ARIR measurements, into which the higher-order directivity of icosahedral loudspeaker’s (IKO) was inserted as directional source of well-studied perceptual effects. The results show how the proposed method performs and compares to alternative rendering methods based on measurements taken in the same acoustic environment, e.g., multiple-orientation binaural room impulse responses (MOBRIRs) from the physical IKO to the KU-100 dummy head, or higher-order SRD ARIRs from IKO to em32 Eigenmike. For optimal externalization, our experiments exploit the beneﬁts of virtual reality, using a highly realistic visualization on head-mounted-display, and a user interface to report localization by placing interactive visual objects in the virtual space.


Introduction
A modular and interactive measurement-based auralization of an acoustic environment benefits from a separation into its source-dependent, room-dependent, and receiver-dependent parts. Typically, the room-dependent part is characterized by a point-to-point room impulse response (RIR), which often assumes that source and receiver are both omnidirectional [1]. However, employing variable source and receiver directivities during auralization requires a more flexible room description that facilitates interfacing between the three parts.
Why source directivity matters : Otondo and Rindel [2] demonstrated that room acoustics parameters change with source directivity, and results from listening experiments indicate that the resulting loudness, reverberance, and clarity changes induced by directivity are perceived by listeners. Vigeant et al. [3] found that including source directivity can increase the realism of auralization results. Latinen et al. [4] showed that the source directivity can be used to alter the direct-to-reverberant ratio, which strongly correlates with perceived distance of a source. Another study employing a source

Auralization of Arbitrary Source Directivity from Measurements
The block diagram of an auralization scenario employing the SRD ARIR is shown in Figure 1 and is similar to that in [26]. Source and receiver directivities are interfaced with the room through Ambisonic input and output signals. Here, the receiver directivity is represented by HRTFs, and a state-of-the-art binaural renderer, e.g., the MagLS renderer as outlined in [18], is used for obtaining the signals that are fed to headphones.

Instrument Directivity
x(t) SRD ARIR multichannel convolution  Here, x(t) is the source signal, h n m nm (t) is the SRD ARIR, and γ S nm (t) and γ R n m (t) are the directional impulse responses of the source and receiver, e.g., ear directivity, respectively.
The concept of the SRD ARIR as well as its use for auralization is described in Section 2.1. The proposed hardware efficient (low order) SRD ARIR measurement method and the upscaling to higher orders is discussed in Section 2.2.

Theory behind Source and Receiver Directional (SRD) RIRs
Based on the image source method [27], or more generally, the geometrical theory of diffraction [28], physically consistent room acoustic models including edge diffraction can be devised based on geometric sound propagation paths, see, e.g., in [29][30][31]. Accordingly, we may write any source-and-receiver-directional room impulse response (SRD RIR) as the sum of discrete propagation paths of the index i h (θ R , t, θ S where each path is characterized by a direction of arrival (DOA) and departure (DOD) denoted as θ R,i and θ S,i , respectively; an arrival time τ i = r i c ; its geometric length r i ; and its attenuation a i through reflection and diffraction on rigid or sound-soft surfaces. For complex surface impedances, multiplication by a i theoretically becomes convolution by an impulse response a i → a i (t) * or it can be expanded to additional paths in discrete-time processing, as preferred here. All vectors describing continuous or discrete directions θ are denoted as unit direction vectors θ = [cos(ϕ) sin(ϑ), sin(ϕ) sin(ϑ), cos(ϑ)] T , with ϕ denoting the azimuth and ϑ the zenith angle; labels S and R refer to source or receiver.
We assume a signal x(t) that gets emitted by a source with the directivity g S (θ S ) and gets picked up with the receiver directivity g R (θ R ). The resulting signal y(t) is described by the convolution with the following impulse response, with h(t) = The RIR h(t) is obtained by weighting the SRD RIR h (θ R , t, θ S ) with both the source and receiver directivities g S (θ S ) and g R (θ R ), assuming they are frequency-independent; for frequency-dependent directivities, multiplication by the directivities is replaced by convolutions with the directional impulse responses of source g S (θ S ) → * g S (t, θ S ) and receiver g R (θ R ) → g R (t, θ R ) * , respectively.
In the Ambisonic domain: Equation (1) is transformed into the spherical harmonic domain by integrating either dependency on the variable sending and receiving direction over the spherical harmonics. As a result, the spherical delta functions are replaced by spherical harmonics (SH) evaluated at either DOA (θ R,i ) or DOD (θ S,i ) of the respective propagation path i. We get where Y m n (θ) are the SH of order n and degree m, and the expression h n m nm (t) denotes a modeled source-and-receiver-directional room impulse response in Ambisonics (SRD ARIR), which we actually measure later on (see Section 2.2).
The directivities g A (θ A ), A ∈ {S, R} can be represented by SH expansion: where N and N are the maximum orders used to represent the receiver and source directivity, respectively. By inserting Equations (3) and (4) into Equation (2), the integrals in Equation (2) invoke the orthogonality property S 2 Y m n (θ A )Y m * n (θ A )dθ A = δ mm nn for both source and receiver, yielding a neat sum for the RIR For natural, frequency-dependent directivities, multiplication by the spherical-harmonic coefficients of the source and receiver directivity γ S nm and γ R n m is replaced by convolution with the coefficients of their directional impulse responses γ S nm → * γ S nm (t) and γ R n m → γ R n m (t) * , now in the SH domain.

Measuring the SRD ARIR: Proposed Method
This section presents the proposed efficient SRD ARIR measurement and postprocessing in detail. Measuring the MIMO RIRs: Here the multiple input multiple output (MIMO) RIRs are measured between a 6-channel compact spherical loudspeaker array (Cubelet) with a radius of 7.5 cm and the 4-channel B-format microphone array (ST450), see Figure 2. The loudspeaker array is equipped with Fountek FR58EX drivers (2 inch coil diameter with ±3 mm maximum linear excursion). A more detailed description on the used arrays and high-resolution directivity measurements can be found online (https://phaidra.kug.ac.at/o:104374). Omni to omni RIR: Depending on the array geometries, an approximation of the point-to-point omnidirectional RIR h 0 (t) can be obtained by transforming both sides of the MIMO RIRs in the SH-domain and extracting the response between the 0th order components. If the array elements are arranged according to a spherical t-design, an approximate of h 0 (t) is obtained summing over all channels in the array domain. Please note that the direct path in h 0 (t) is ideally a single impulse. However, due to the non-ideal responses of the loudspeakers and microphones as well as the array geometries, even the direct path will be spread in time. A possible approach for improving the omnidirectional response is outlined in [32], but it is not employed here. A denoising of h 0 (t) is optional but recommended when experiencing unrealistic long reverberation times. We suggest a denoising strategy that is similar to [33] and it is derived in the Appendix A.
DOA and DOD estimation: Due to the assumption of a temporally and spatially sparse RIR, we address a direction of arrival (DOA) θ R (t) and direction of departure (DOD) θ S (t) to each discrete time instance t of h 0 (t). While due to reciprocity any DOA estimation method, e.g., as summarized in [34], can be employed for both DOA and DOD estimation, we use the pseudo intensity vector approach (PIV) as presented by Jarrett et al. [35] for the DOA and an r E -vector measure [36] related to the magnitude sensor response (MSR) by Politis et al. [37] for determining the DOD.
The DOAs are calculated for the frequencies between 100 Hz and 2.5 kHz. Here, the upper frequency limit is chosen below the spatial aliasing frequencies f a = c 2πr ST450 ≈ 3.6 kHz for r ST450 = 1.5 cm (defined by kr ST450 = 1). For the estimation of the DODs, a less restrictive rule is assumed, as it is less affected by linear interference. Here, the upper frequency limit is f a = c πr cubl. ≈ 1.4 kHz for r cubl. = 7.5 cm (inter-transducer arc length roughly below half a wavelength π 2 r cubl. ≤ c 2 f ). The low cut at 100 Hz minimizes low-frequency disturbance in both the DOA and DOD estimation, respectively. DODs and DOAs become where θ p indicates the direction of the p-th loudspeaker, P = 6 is the number of array loudspeakers, h p,0 the RIR between the p-th loudspeaker and the W channel of the ST450 array, · is the norm operator, and h p,XYZ are the first-order channels of the ST450 microphone array. Both the DOA and DOD are computed using a zero-phase band limitation (e.g., by MATLAB's filtfilt with a 4th-order Butterworth band pass) denoted by F f l −f u and a zero-phase temporal smoothing F L of the resulting estimates using a moving-average Hann window in the interval [−L/2; L/2] for L = 32.
SRD ARIR: From Equation 3, and assuming a single propagation path at a time (i.e., assuming temporal disjointness), a first version of the upscaled SRD ARIR becomes where the maximum orders n ≤ N and n ≤ N can be chosen freely. The multiplication of the omnidirectional RIR h 0 (t) with the SH representations of directionally sharpens the measured SRD ARIR, accordingly. However, the implicit assumption of disjointness (there being only a single DOA and DOD per time sample) is not necessarily true in the late diffuse part of the response. As a result, the temporal fluctuations of θ R (t) and θ S (t) cause amplitude modulation that potentially corrupt narrow-band spectral properties inh n m nm (t). A typical result thereof is a mixing of the longer low-frequency reverberation tails towards higher frequencies, causing unnaturally long reverberation there [21,38], especially as the orders n, n increase. We propose a scheme for spectral correction which is similar to the one in [38] but adopted for SRD ARIR processing.
In theory, the expected temporal energy decay in an ideal (isotropic) diffuse field should be identical for any source and receiver of random-energy-efficiency-normalized directivity such as the spherical harmonics; this must hold also after decomposition into frequency bands. However, less restrictively, even in non-isotropic diffuse fields, the expected energy decay is identical for subsets of source and receiver directivities that are (pseudo-)omnidirectional: Formal derivation in [38] showed that quadratic summation across same-order spherical harmonics is omnidirectional. Thus, from Equation 8 and with the Unsöld's Theorem [39] ∑ m |Y m n (θ)| 2 = 2n+1 4π for θ ∈ S 2 we obtain consistent powers of processed and original RIRs To moreover enforce the short-term energies in [h n m nm (t)] 2 to become spectrally consistent with those of h 2 0 (t), third-octave filtering is useful, where the bth sub-band signal F b {h 0 (t)} with center frequency f b is obtained from a bank of zero-phase filters F b that is perfectly reconstructing h 0 (t) = ∑ b F b {h 0 (t)}. For every sub-band b and the orders n, n , an energy decay of the upscaled SRD ARIR F b {h n m nm (t)} consistent with the original one of F b {h 0 (t)} is enforced by envelope correction h n m where F T {·} denotes temporal averaging with a time constant T (e.g., 46 ms). The simplified Matlab source code of the proposed SRD ARIR method can be found in Appendix B.

Listening Experiment-Comparative Study
Due to the well-studied perceptual effects [5,7], its well-defined third-order beamforming [24], and its already available high resolution directivity measurements, see, e.g., in [13], the icosahedral loudspeaker (IKO) is employed as the source with controllable directivity throughout the listening experiment. The experiment itself aimed at evaluating the authenticity and perceived externalized localization achieved with the proposed SRD ARIR method, and to compare it with other auralization techniques. The tested five measurement-based auralization techniques (virtualization of the IKO) are described in Section 3.1. An overview of the design and implementation of the listening experiment is presented in Section 3.2 and insights on the statistical analysis of ratings and the corresponding results are presented in Section 3.3.

IKO Filters
renderer Ambisonics  Note that gray-shaded boxes indicate functional blocks which are shared between different techniques. The boxes delimited by bold lines mark techniques in which the actual Room Transfer Function is not directly measured but obtained by processing (e.g., upscaling by an ASDM method) of the underlying measurements as proposed in Section 2.2. A detailed description of the techniques can be found throughout Section 3.1.
All underlying measurement data, a short description of the measurement set-up, directivity measurements, and response data as well as the evaluation of the listening experiment are made available online (https://phaidra.kug.ac.at/o:104417 ).

Auralization Techniques-Virtualizations of the IKO
As acoustic virtualizations of the IKO, we compared five different auralization techniques in the listening experiment. The block diagram in Figure 3 depicts these techniques and their details are given below. The ear signals y l (t) and y r (t) are obtained by running the source signal through several processing stages. Those stages include (i) beam encoding, (ii) IKO control, (iii) directivity, (iv) room transfer function, (v) Ambisonics encoding, and finally (vi) the dynamic binaural rendering.
With the source signal x(t) and the desired beam direction ϕ S , the frequency independent encoder outputs the order N B Ambisonics representation of the beam. The processing in the Beam Encoding stage is independent of the auralization technique.
In the IKO Control stage the (N B + 1) 2 channels are mapped to 20 loudspeaker signals of the IKO using the frequency-dependent IKO Filters. A measurement based approach for designing the multiple-input multiple-output (MIMO) IKO control filters is presented in [40]. It is based on laser Doppler vibrometry measurements and allows for control of side-lobe suppression and excursion-limiting filter design. The designed beam patterns were verified by far-field extrapolated measurements from a surrounding microphone array. The IKO's beamforming can be analyzed by using the open source tool balloon_holo, which is part of IEM's Open Data Project (https: //opendata.iem.at/projects/dirpat/ ). All underlying measurements (laser Doppler vibrometry and pressure of a surrounding microphone array) as well as the corresponding IKO Filters can be found online (https://phaidra.kug.ac.at/o:67609 ) and a summary is presented in [13]; here, we used the IEM IKO3 (https://phaidra.kug.ac.at/o:75316 ).
In the latest stage, Dynamic Binaural Rendering of the Ambisonic scene is obtained by a convolution of the rotated Ambisonic signals with any state-of-the-art FIR binaural Ambisonic renderer. Here, we employ the time-invariant filters of the MagLS (magnitude-least-squares) renderer (The MagLS renderer is part of the IEM plugin suite which can be found here https://plugins.iem.at/ ) defined in [18,41] to get high-quality Ambisonic rendering already with an order as low as N = 3. The perceptual quality improvement of these filters is achieved by using a magnitude-least-squares optimization that disregards phase match in favor of an improved HRTF magnitude at high frequencies.
MagLS as outlined in [18,36] also includes an interaural covariance correction that offers an optimal compromise for consistently rendering diffuse fields.
All other processing stages are rather specific per auralization technique and are therefore described separately below.
Dummy head BRIR-based (Dy): The Directivity and the Room Transfer Function are inherent in the directly measured multiple orientation BRIRs (MOBRIRs) between each loudspeaker of the IKO and the KU100 (https://en-de.neumann.com/ku-100) dummy head. Here, we used an orientation resolution of ∆ϕ = 15 • on an interval between ϕ = [−45 • , . . . , 45 • ] to obtain the MOBRIRs and the data is available online (https://phaidra.kug.ac.at/o:104386). The Dy technique, the Dynamic Binaural Rendering is achieved by the linear interpolation with switched high frequency phase (LISHPh) method as described in [42]. In accordance with the findings in [22], setting the crossover f c = 2kHz, ∆ϕ = 15 • , and L = 16 allows for high-quality BRIR-based binaural rendering, and thus this condition is used as a perceptual target in the study. The processing steps of the reference auralization are shown in the top row of the block diagram in Figure 3. Although the auralization quality (audio quality and spatial mapping) of the Dy technique is expected to be high, the measurement effort for the multiple orientations is somewhat enlarged, and the specific dummy head HRIRs cannot be exchanged unless multi-orientation measurement are repeated with other receivers, dummy heads, or individual subjects, separately.
IKO to em32 MIMO RIR (Em): Here, the Directivity and the Room Transfer Function are represented by the measured array domain MIMO RIRs between the 20 IKO loudspeakers and the 32 microphones of the em32 (https://mhacoustics.com/products ). The resulting em32 signals are transformed in the Ambisonics domain using the state-of-the-art encoder presented in [36,40] and are finally binaurally rendered. An evaluation of this specific auralization technique can be found in [25] and the inherent processing stages are depicted in the second row of Figure 3. The underlying MIMO RIRs are accessible online (https://phaidra.kug.ac.at/o:104385). Measuring with the em32 or other higher-order compact spherical microphone arrays increases the hardware effort in terms of channel counts, but permits modular exchange of the receiver directivities or HRIRs, and achieves a native higher-order resolution at the receiver side. Further resolution enhancement by HOSIRR [43] is thinkable but was not used here.
Multi ASDM RIRs (As): This approach employs the first-order tetrahedral ST450 microphone array at the receiver side for measuring the 20 × 4 (IKO to ST450) MIMO RIRs, which are available online (https://phaidra.kug.ac.at/o:104384 ). However, the MIMO RIRs are not used directly as the representation of the Directivity and Room Transfer Function. In a processing stage, the Ambisonic Spatial Decomposition Method (ASDM) [22] is applied to every transducer of the source array; here, the IKO, and the resulting upscaled ASDM RIRs, are eventually used for auralization. This permits a modular exchange of the receiver directivity or HRIRs while keeping the hardware effort at the receiver side minimal. Note that the multi ASDM method is a special form of the SRD ARIR approach, cf. assuming a fixed directivity at the source (the individual loudspeaker) and setting N = 0 in Equation (5).
SRD ARIR and real IKO (Sr): The SRD ARIR method as proposed in Section 2 only requires first-order loudspeaker and microphone arrays for measuring the Room Transfer Function, on the source and receiver sides respectively. Thus, the SRD method is rather hardware efficient with a theoretical minimum of 4 channels for the source and the receiver. Here we used our 6-channel Cubelet and the tetrahedral ST450 as source and receiver arrays, respectively. Note that the first-order RIR measurements (https://phaidra.kug.ac.at/o:104376 ) as well as high resolution directivity measurements of the Cubelet are available online (https://phaidra.kug.ac.at/o:104374).
In a next processing step, the resolution is upscaled from first order to any higher order, see Equation 8 and the detailed description throughout Section 2.2. Therefore, both the source and receiver side are modular and permit exchange with any directivity pattern. In the experiment we inserted KU100 HRIRs with 5th order resolution of a MagLS decoder [18], and at the source side the true measured directivity of the IKO are used. The Directivity of the 20 loudspeakers is represented using an order N representation of the directional IRs from every loudspeaker to every microphone of a surrounding microphone array. We use IRs measured using an equiangular grid of 18 × 36 zenith and azimuth angles, respectively. With 648 sampling points on the sphere we set N ≤ 17. The high resolution directional IRs of the IKO are available online (https://phaidra.kug.ac.at/o:75316).
SRD ARIR and ideal 3rd-order directivity (Si): While the Room Transfer Function is represented by a SRD ARIR as well (same as for Sr), the source Directivity is assumed to be an ideal 3rd-order directivity instead of the real IKO, here. Thus, the directivity is synthesized by multiplying the encoded signals with a frequency-independent diagonal matrix containing the max-r E weights [44,45] up to order N B .

Design and Implementation
Measurements: The underlying measurements are done in the György Ligeti Saal (V = 5630 m 3 , T 60 = 1.4 s) in Graz, Austria. Figure 4 shows a panoramic photo of the measurement setup and Figure 5 the layout of source, receiver, and the locations of the four reflecting baffle (0.9 × 1.8 m) positions. Source and receiver were aligned quasi-parallel to the shorter side walls of the room, are facing each other, and are 4.2 m apart. The source-receiver distance approximately corresponds to the critical distance (r H = 3.6 m) when assuming an omnidirectional source, and thus is considered generally interesting. As test signal we used interleaved and exponentially swept sines with a length of 4 s. The measured source and receiver configurations included (i) MOBRIRs (https://phaidra.kug.ac.at/o: 104386) between the IKO and multiple dummy head orientations (measurements for Dy), (ii) MIMO RIRs (https://phaidra.kug.ac.at/o:104385) between the IKO and the em32 (measurements for Em), (iii) MIMO RIRs (https://phaidra.kug.ac.at/o:104384) between the IKO and ST450 (measurements for As), and (iv) MIMO RIRs (https://phaidra.kug.ac.at/o:104376) between the Cubelet and the ST450 (measurements for Sr, and Si).  x, in meter y, in meter Implementation: For the sake of reproducibility and in order to circumvent room divergence [46][47][48], i.e., violation of acoustical expectations arising from the environment in which one listens to headphones, the entire scenery was modeled in virtual reality. In order to deliver graphics as realistically as possible, the room was modeled based on building plans and photogrammetry. Control buttons and labels were added to the virtual environment to give the participants control over the progression of the experimental trials and means to comparatively rate their auditory localization under the various conditions. A screenshot of the user interface is depicted in Figure 6. In addition to the typical playback and save/proceed (upwards facing arrow) buttons we used five colored squares and the corresponding spheres for controlling the experiment. Depending on the tested multistimulus set, those colored squares correspond either to all auralization techniques for a fixed beam direction or to all beam directions for a fixed auralization technique. As VR game engine we used Unity (https://unity.com/) and the experimental game was played using the HTC VIVE, i.e., system comprising head-mounted display (HMD), controllers, and tracking. The tracking data, i.e., head rotations, from the HMD was sent to Reaper (https://www.reaper.fm) via OSC [49], where the audio processing was implemented.
While Rakerd and Hartmann [50] stated that short onsets and transient signals overall simplify localization, Wendt et al. [7] discovered that such signals are localized significantly closer to the IKO. In order to create a large scenery of perceivable auditory objects distributing to various remote locations with regard to the IKO, Wendt et al. [7] recommends using signals with slow onset. For conditions with clear effects, we therefore used a 1.5 s long pink noise burst with fade-in and fade-out times of 500 ms (linear fades) and 500 ms silence at the end.
For encoding and multi-channel convolution with IKO control filters, directivities, and RIRs we used the mcfx (http://www.matthiaskronlachner.com/?p=1910) plug-ins and as binaural renderer of the Ambisonic signals we used the BinauralDecoder (https://plugins.iem.at/) [18,41]. The ear signals were played back via headphones (AKG 702) plugged into an external audio interface (RME MadiFace & RME FireFace UCX). Note that an orientation mismatch < 5 • between different arrays used for measuring the RIRs (cf. the IKO vs. Cubelet, and ST450 vs. em32) can almost not be avoided. Thus, the authors perceptually aligned the auralization techniques for ϕ S = 180 • .
During informal listening experiments (by four participants) we found that all auralization techniques under test obey a high overall sound quality (no artifacts or temporal smearing). However, the overall timbre slightly varies across the techniques as we employed technique specific measurement hardware (e.g., Cubelet vs. IKO). While a global and steering-direction-independent equalization was not feasible, the techniques were perceptually equalized using a parametric multi-band EQ for a fixed steering direction ϕ S = 180 • (pointing to the listener).
Input Method: During the experiment, participants were asked to indicate the position (i.e., direction and distance relative from the listener) of the perceived sound and to follow a certain procedure: (i) point to a colored square to select a stimulus for looped playback, (ii) pick-up the correspondingly colored ball by pointing towards it and pressing the trigger, (iii) with trigger pressed, point to the perceived direction and adjust the distance by moving the thumb on the controller track pad, (iv) release the trigger to drop off the ball at the intended position, (v) proceed until all balls are positioned, then save responses and proceed to next multistimulus set. Participants were allowed to reposition any ball as often as desired until the responses of the entire multistimulus set were logged in.
Design: The experiment consisted of 12 multistimulus sets, of which the first one was part of a training and familiarization phase. In the following 11 sets, participants were asked to rate 5 stimuli per set. Those 5 stimuli either consisted of all the five auralization techniques and a fixed beam direction (6 sets), or of all beam directions (except the −45 • beam direction) (In order to keep only 5 stimuli per scene, scenes with a given auralization technique were not containing the −45 • beam direction. For this reason results obtained for the −45 • beam direction were not used in the statistical analysis.) for a fixed auralization technique (5 sets). Both the order of sets, as well as the assignment of a stimulus to a certain colored square within the set were randomized. The 13 participants (normal hearing, all male, age between 24-52) were asked to repeat the experiment in order to provide a second response per set. Correspondingly, most of the 13 participants (except 1 who did not repeat the experiment) evaluated 11 · 5 · 2 = 110 stimuli.

Results and Discussion
The positions of the perceived sound objects are given in the Cartesian coordinate system with the listener at the origin. As results show little to no variation in height (z coordinate), we focus on an evaluation of the x, y coordinates.
Overall inspection in two dimensions: In a first processing step, outliers are defined as responses lying outside a Mahalanobis distance (in estimated standard deviations) of three within a preliminary, non-robust analysis. After removal of the outliers, we use bivariate statistical analysis to estimate the means and their standard deviation and 95% confidence region according to Hotelling's T2 distribution (see [51] Ch. 3). The result of this analysis are depicted in Figure 7, where data points, outliers, and standard deviation and confidence region ellipses are indicated as dots, crosses, and not-filled and filled ellipses, respectively. In case of similar sizes of the statistical spreads, statistically significant differences may be inspected by observing whether the mean value of one condition lies outside the 95% (p < 0.05) confidence ellipses of the other conditions.
While each row of Figure 7 depicts the results for all auralization techniques and a certain beam direction ϕ S , each column shows the perceived position of the auditory events per technique and for all beam directions ϕ S = [0 • , −36 • , −82 • , 180 • , 90 • ]. Thus, comparison within the rows is used to identify differences across auralization techniques, and comparison within each column gives indication if each auralization technique is able to reproduce the well-described perceptual effects of the IKO [7], or similar devices. These effects are explained by exciting pronounced propagation paths and dimming the direct path, which is known to evoke auditory events whose position needs not coincide with the physical source. Moreover, Wendt et al. [5] showed that the IKO's directivity allows for altering the DRR and thus, for controlling the perceived distance, e.g., by steering the beam towards or away from the listener.
Overall, we observe that all techniques are qualitatively able reproduce the perceptual effects known from studies involving the physical IKO [5,7], cf. columns in Figure 7. The ratings show a clear consensus with the expected positions of the auditory events, i.e., by steering a beam towards a reflector ϕ S = [−36 • , −82 • , 90 • ] the auditory event is located near the respective reflector baffle. Moreover, steering the beam towards ϕ S = 0 • away from the listener and ϕ S = −180 • towards the listener either evokes an auditory event behind the physical IKO or a very close one, respectively. We found that the ratings per beam direction ϕ S are significantly different for all auralization techniques.
A detailed analysis of the differences related to the auralization techniques, cf. rows in Figure 7, is done for independent univariate attributes. These univariate attributes were not asked separately in the listening experiment, but they are obtained for the subsequent analysis by mapping of the responses to the following independent attributes (i) localizability, (ii) the direction, and (iii) distance. This analysis is based on the following considerations. x, in meter x, in meter x, in meter x, in meter  x, in meter  As defined by Lindau et al. [52], localizability is related to the ability to asses the spatial extent and location of a sound source. If this task is difficult, the localizability is low and if localizability is high, a sound source is clearly delimited. Moreover, localizability is often associated with the perceived extent of as sound source and thus we assume that the area of the standard deviation ellipse can be used as an indication of the localizability.
The two-dimensional source position indications yield a clear bivariate distribution, cf. Figure 7, and with a mean angular offset between the main axis of the standard deviation ellipse and the mean perceived direction ϕ p (defined by listener and mean position of the perceived sound) of only 3.52 • , we may assume the variations to be independent along the perceptual axis of distance and direction.
As this visual evaluation may be difficult, we use a Wilcoxon signed-rank test [53] with a Bonferroni-Holm correction [54] to determine p-values of pairwise comparisons between test conditions and define p < 0.05 as significantly different throughout this article. We employ nonparametric statistics as we do not assume a normal distribution of ratings and due to the correction (David Groppe, 2020, Bonferroni-Holm Correction for Multiple Comparisons, https://www.mathworks.com/matlabcentral/fileexchange/28303-bonferroni-holmcorrection-for-multiple-comparisons) p-values can exceed the expected range and thus p > 1 is valid. The Matlab script of the statistical analysis and the raw listener ratings are available online in the accompanying project (https://phaidra.kug.ac.at/o:104416). A detailed discussion of the results in terms of localizability, direction, and distance is given below. Localizability: The median values and 95% confidence interval of the area under the standard deviation ellipse pooled for all beam directions ϕ S are depicted per auralization technique in Figure 8. While the median values indicate highest and lowest localizability for the Dy and Em techniques, respectively, the differences among all techniques is not significant (p > 1.7). Direction: All ratings are transformed into a polar coordinate system with the listener at the center and are analyzed for azimuth and radius, i.e., direction and distance, separately. Due to the findings in [22] we assume that the Dy auralization can be used as the reference condition. Thus, the p-values are given for testing the significance levels between Dy and all other conditions. The median values and 95% confidence intervals for all beam directions and auralization techniques are shown in Figure 9 and the in p-values between the reference condition (Dy) and the other techniques are presented in Table 1.  Despite an almost symmetrical set-up (cf. Figure 5) and only a slight change in beam direction, the direction results for the beams steered towards ϕ S = −82 • and ϕ S = 90 • show some deviation. Ratings for ϕ S = 90 • are less consistent (larger confidence interval) and the auditory event is localized more closely to the IKO for all techniques, when compared to ϕ S = −82 • . This can be explained when taking a closer look at the (ideal) 3rd-order max-r E weighted beam pattern as depicted in Figure 5. While for ϕ S = 90 • a side lobe is pointing towards the listener, this side lobe is almost avoided (−5dB lower) for ϕ S = −82 • .
Overall, there is no significant difference between the directional mapping of the Dy and Sr techniques. For all other techniques we found significant differences for some beam directions. The Em-based auralization was the least consistent and produced the smallest lateralisation for ϕ S = −82 • and ϕ S = 90 • compared to the other techniques. This is particularly pronounced for ϕ S = 90 • where there is more direct sound. Distance: The median values and 95% confidence intervals for all beam directions and auralization techniques are shown in Figure 10 and the in p-values between the reference condition (Dy) and the other techniques are presented in Table 2.  We found that the congruence of the distance mapping is high for all tested auralization techniques, almost independently of the specific beam directions ϕ S . The only exception is the Em based auralization, where for ϕ S = 180 • the source is perceived significantly further away.  Discussion: The five different auralization techniques used to virtualize the IKO all involve measurements of RIRs. The results from the presented listening experiment verify that all tested techniques are able to qualitatively reproduce the perceptual effects known from studies involving the physical IKO [5,7], cf. ellipses in Figure 7. Moreover, still, a detailed analysis of the perceptual attributes of direction, and distance, indicates some significant differences to the reference condition (dummy head based rendering Dy) for specific combinations of technique and beam direction, with some noticeable trend.
In order to give an indication of the directional mapping quality of all tested auralization techniques, the mean direction offset to the Dy technique is listed in Table 3. Overall, the Em techniques yields with 5.72 • the largest incongruence in directional mapping, while the errors of As, Sr, and Si roughly stay below 2 • . The mean distance offset between Dy and all other techniques pooled for all beam directions ϕ S is given in Table 4. With a mean distance error of 0.28m the As, and Sr auralizations clearly outperform the Em, and Si techniques. Although there is no clear perceptual winner, it still appears that the As and Sr approaches match the Dy reference best. We assume that using the first-order source (Cubelet) enhances the flexibility as its frequency range for directivity synthesis is larger than the one of the IKO because of its smaller size. Its upscaling permits a modular exchange of the directivity to arbitrary artificial or measured higher-order directivity patterns. On the receiver side, measurements with the dummy head (Dy) are not modular in terms of exchanging the receiver directivity patterns in terms of other HRIRs, while the measurements with the em32 and ST450 microphone arrays permit HRIR exchange. The em32 reaches a native resolution up to the 4th order, whereas the 1st-order ST450 is less demanding in terms of channel count, has no satisfactory native resolution but allows to be upscaled to higher orders.
Our impression is that auralization involving a first-order receiver and either the highly directional source or its first-order source as a replacement tend to work most reliably. Essentially, the results allow us to recommend the SRD ARIR (Sr) model and processing method for its high degree of modularity and reduction of measurement hardware and effort. It reaches perceptual qualities comparable to rendering based on dummy-head measurements (Dy), while higher-order directional sources can be exchangeably interfaced with the processed and upscaled SRD ARIRs. It is necessary to mention that the particular prototype employed as first-order measurement source (Cubelet) in our study is not necessarily powerful enough for every application, for instance, when the signal-to-noise ratio is low because of background noise. In such cases, stronger alternatives could be considered [55].

Conclusions
In this contribution we presented the concept and a comparative perceptual evaluation of a source-and-receiver-directional ARIR capture and processing approach (SRD ARIR) with a variety of technical alternatives. Although the proposed SRD ARIR rendering method only employs a small set of first-order directivities (omnidirectional and figures of eight aligned with x, y, and z) in the measurement, the approach produced auralization of higher-order directional source and receiver configurations that was performing well in the comparison. Its directional resolution enhancement involves the Ambisonic spatial decomposition method (ASDM) that we could extend to both sides of the measured ARIRs.
In the dynamic headphone-rendering-based evaluation, we employed the highly directional icosahedral loudspeaker (IKO) as a virtual source because of its well-described measured directivity and the well-studied perceptual effects it causes. For the sake of reproducibility and to obey optimal externalization the auralization listening experiment was done within a head-mounted-display visualization of the virtual environment. Interactive visual objects were used to indicate auditory event locations in space.
The proposed SRD ARIR method performed similarly accurate as the reference auralization based on multiple-orientation binaural room impulse responses (MOBRIRs). We found no significant difference for the perceptual attributes of localizability, direction, and distance. Although most of the alternative techniques performed comparable to the reference auralization, the SRD ARIR technique has benefits in terms of modularity and efficiency: It only requires a small number of hardware channels, and SRD ARIRs offer a generic interface between RIRs and source and receiver directivities. Any application requiring a flexible exchange of directivities can potentially benefit from the small number of responses needed to characterize the room (be it measured or simulated).
A collection of room responses measured for the study, responses of the listening experiment, the statistical analysis, and high-resolution directivities of the arrays is made available online (https: //phaidra.kug.ac.at/o:104417). Funding: The major part of the work was carried out within the OSIL project AR 328-G21 that was funded by the Open Access Funding by the Austrian Science Fund (FWF).