Diffuse-Field Equalisation of Binaural Ambisonic Rendering

: Ambisonics has enjoyed a recent resurgence in popularity due to virtual reality applications. Low order Ambisonic reproduction is inherently inaccurate at high frequencies, which causes poor timbre and height localisation. Diffuse-Field Equalisation (DFE), the theory of removing direction-independent frequency response, is applied to binaural (over headphones) Ambisonic rendering to address high-frequency reproduction. DFE of Ambisonics is evaluated by comparing binaural Ambisonic rendering to direct convolution via head-related impulse responses (HRIRs) in three ways: spectral difference, predicted sagittal plane localisation and perceptual listening tests on timbre. Results show DFE successfully improves frequency reproduction of binaural Ambisonic rendering for the majority of sound source locations, as well as the limitations of the technique, and set the basis for further research in the ﬁeld.


Introduction
Ambisonics is a three-dimensional spatial audio approach based on the spatial sampling and reconstruction of a sound field using spherical harmonics, first introduced by Michael Gerzon in the 1970s [1,2]. Ambisonic sound fields can be recorded using multiple capsule microphone arrays or encoded using point sources, and reproduced over multiple loudspeaker arrays or over headphones. In Ambisonic playback, assuming free-field reproduction conditions and plane wave propagation, the reproduced sound field is accurate only for the region of a head in the centre of the loudspeaker array, for a frequency range up to the spatial aliasing frequency f alias . This is due to the inherent limited spatial accuracy of recording and reproduction of a physical sound field with a finite number of transducers.
Above f alias , timbral inconsistencies exist due to comb filtering from the summation of coherent loudspeaker signals with multiple delay paths to the ears, which are not situated in the exact centre of the loudspeaker array. In practice this reduces the accuracy of reproduced timbre and height cues. By increasing the order of Ambisonics, which requires more microphones and encoded channels in production and storage and more loudspeakers in reproduction, f alias rises improving both localisation and timbre, though for all practical Ambisonic rendering systems at present f alias is still very much within the human hearing range.
Horizontal localisation accuracy can be improved above f alias by employing different decoding methods for high frequencies [3], though this does not improve timbre. As accurate timbral reproduction has been shown to be more important than localisation when measuring the authenticity of a spatial audio experience [4,5], it is essential to address timbre above f alias in Ambisonic reproduction.
The traditional method for rendering Ambisonic sound fields over headphones is to decode the sound field to a specified loudspeaker configuration (as in loudspeaker rendering), and then take the sum of a convolution of the loudspeaker signals with head-related impulse responses (HRIRs) corresponding to each loudspeaker's position [6,7] (often referred to as the virtual loudspeaker approach). Recent methods for binaural rendering of Ambisonics have moved away from the virtual loudspeaker approach and instead focused on order truncation of a spatially continuous spherical harmonic (SH) represented head-related transfer functions (HRTF) data set [8][9][10], which through pre-processing techniques such as equalisation and time-alignment can produce improvements in spectral response at high frequencies [11][12][13]. However, this requires a highly dense dataset of HRTFs capable of artefact free Ambisonic rendering (>2700 to render 35th-order Ambisonics [10]) and is therefore considered impractical and infeasible for individualisation at present, despite techniques such as reciprocity [14] and multiple swept sine [15] offering faster measurement times. Therefore, this paper focuses on virtual loudspeaker rendering of Ambisonic signals, though with pre-encoding of the virtual loudspeaker HRTFs into the SH domain, which allows for dual-band decoding and loudspeaker configurations with more loudspeakers than SH channels both at no added computational cost.
This paper presents a novel method for improving high-frequency reproduction in binaural Ambisonic rendering, without increasing the order of Ambisonics or real-time computational cost. Diffuse-Field Equalisation (DFE) is the removal of the direction-independent aspect of a set of frequency responses measured at all directions on a sphere. By applying the technique of DFE to binaural Ambisonic rendering, high-frequency reproduction can be improved and therefore a more realistic spatial audio experience can be conveyed to the listener.
In a preliminary study, DFE was shown to improve timbre between binaural Ambisonic rendering and standard binaural rendering (HRIR convolution) for 1st-order Ambisonics [16]. This paper expands the investigation to higher order Ambisonics. The methodology of diffuse-field equalisation is explored to determine which quadrature method to use and the optimal number of measurements necessary to produce a sufficient approximation. DFE of Ambisonics is evaluated in three ways; spectral difference over the sphere, predicted sagittal plane localisation and perceptual listening tests. Three orders of Ambisonics: M = 1, 3 and 5 are investigated.

Encoding
A sound source can be encoded into Ambisonic format for a given location of azimuth θ and elevation φ with Ambisonic order M through multiplication with three-dimensional full normalised (N3D) SH functions Y, defined as where σ = ±1, P mn (sin φ) are the associated Legendre functions of order m and degree n, and δ n,0 is the Kronecker delta function, The total number of SH channels in an Ambisonic encoded sound field is calculated as K = (M + 1) 2 .

Decoding
The amount of loudspeakers L required to reproduce the sound field is determined by L ≥ K, though in the case of L = K, which produces the least errors in spatial reproduction, audible timbral artefacts occur when a sound is panned to the exact location of a loudspeaker [17,18]. All loudspeaker configurations used in this paper therefore conformed to the rule L > K. The three loudspeaker configurations used in this study for orders M = 1, 3 and 5 were Lebedev grid [19] arrangements with L = 6, 26 and 50 loudspeakers respectively. Lebedev grids are quadratures well suited for practical Ambisonic playback due to their near-exact orthonormal capabilities [20] and nesting of the L = 6 and L = 26 in the L = 50 configuration. To test the regularity of the loudspeaker configurations for SH sampling, orthonormality error E O was calculated as where I K denotes the K × K identity matrix, C is the re-encoding matrix and transposition is notated by a superscript T [18,21]. The layout of loudspeakers and E O matrix plots of the three Lebedev configurations are presented in Figure 1 for Ambisonic orders up to M + 1, which show the regularity for the 6 pt. and near-regularity of the 26 pt. and 50 pt. configurations. Separate decode matrices were calculated for low and high frequencies for optimal horizontal localisation. Pseudo-inverse mode-matching decoding produces the closest approximation of the original sound field for near-regular loudspeaker arrangements with a non-square re-encoding matrix [22]. For frequencies up to f alias therefore, a pseudo-inverse decoding method with basic channel weighting was used.
Above f alias , where the sound field is inadequately reconstructed, pseudo-inverse decoding with Max r E channel weighting was used, which aims to reproduce the energy vector r E = 1 for all directions [18,22,23]. The gains g m to be applied to k such to maximise r E are found from differentiation of r E with respect to g m ( [18], p. 312), such that This recurrence equation can then be rewritten using Legendre polynomials to match Bonnets' Recursion Formula [24], if the rules g −1 = 0 and g M+1 = 0 and therefore g 0 = 1 and g 1 = r E are followed, such that η = r E and g m = P m (r E ), for orders m = 0, 1, ..., M: where r E is the largest root of P M+1 : Applying these weights to the Ambisonic channels in the decode process attempts to produce a constant value of the energy vector at all locations on the sphere, which improves high-frequency localisation and interaural level difference (ILD) reproduction [3]. However, in practice Max r E weighting reduces the overall amplitude of the decode; a trend that becomes more pronounced as the Ambisonic order increases. This reduction in amplitude is not uniform across the frequency spectrum but rather focused on high frequencies (see Figure 2). The high-frequency attenuation caused by Max r E weights was hence negated through amplitude compensation [22,23] prior to the dual-band crossover. The values for each order were calculated from the root-mean-square (RMS) of the Max r E channel weights g m ( [18], p. 183). Table 1 presents the RMS Max r E channel weights g mRMS for Ambisonic orders 1 to 5.   Binaural decoders were computed by encoding the HRTFs of each loudspeaker configuration into a continuous SH representation by a summation of the multiplication of each channel of the decoding matrix with the corresponding HRTF such that where H is the HRTF and D is the decode matrix. This was repeated for both Basic and Max r E weighted decoders. HRIRs in this study were from the Neumann KU 100 dummy head [25].
Crossover filters between the low-and-high frequency decoders were phase matched to avoid unwanted destructive interference around the crossover frequency [22,26]. Whereas in loudspeaker reproduction, the transition between low and high frequencies should be implemented gradually [26,27], when rendering Ambisonics binaurally the crossover can be implemented more abruptly and at a higher frequency due to the head's fixed position inside the virtual loudspeaker array. Crossover frequencies were calculated from the integrated D-error, which is the error between a wave field constructed from a single plane wave and an Ambisonically reconstructed plane wave [18,28,29]. For Ambisonic orders 1 to 5, dual-band decoder crossover frequencies are presented in Table 2. Table 2. Dual-band decode crossover frequencies producing 20% integrated D-error for Ambisonic orders 1 to 5, according to [23]. The two binaural decoders were high and low-passed using the above filters and combined. An advantage of using an SH HRTF format binaural decoder is that the decoding matrix takes length K instead of L, which removes the increased computation of additional convolutions in the case L > K.
Binaural rendering A is then achieved by convolution of the encoded Ambisonic format signals with the binaural decoder: where B K is the encoded Ambisonic signal, D SH K is the dual-band binaural decoder and * denotes convolution.

Diffuse-Field Response Calculation
The diffuse-field response of the SH binaural Ambisonic decoder D SH k can be calculated from a sum of the RMS of the SH channels, a process referred to as numerical integration. However, in this study an alternative approach was taken through spatial sampling of the sphere (which was found to give equivalent results), so that further developments could be explored, such as directional bias in the diffuse-field response for localised spectral improvements [30].
The Ambisonic diffuse-field response A diff , calculated from the RMS of Q measurements, was approximated (separately for each ear) by: where A q is a binaural Ambisonic render and Ω q is the relative solid angle weight of the virtual sound source position, which is calculated as the area subtended by its position on the sphere. In order to determine the optimal quadrature method and number of points necessary to produce an adequate approximate diffuse-field response, four quadrature methods were investigated with varying number of points by rendering approximate diffuse-field responses.
Four quadrature methods were investigated for the distribution of points on a sphere: the Lebedev grid [19], Icosahedron division [31], Fibonacci spiral [32] and spherical T-design [33]. Voronoi sphere plots of the quadrature methods with a similar number of points are shown in Figure 3 to compare the regularity of the quadrature methods. The plots show T-design quadrature produces the highest regularity of the four methods. Simulated diffuse-field responses using the four quadrature methods differed by up to ± 1.5 dB at 1 kHz, however implementation of solid angle weighting brought the variation to below ± 0.1 dB. Therefore, providing solid angle weighting is implemented, the quadrature method need not be highly regular.  The minimum number of measurements necessary to calculate a sufficient approximation of the diffuse-field was investigated by rendering diffuse-field responses with a varying number of measurements. The number of measurements ranged from Q = L to Q = 1 × 10 5 using Fibonacci quadrature, for the three tested orders of Ambisonics, and validated through comparison to the numerical integration method. With solid angle weighting implementation, variation in calculated diffuse-field response was as little as ±1 × 10 −4 dB at 20 kHz when Q > 4L, which will be perceptually negligible. Therefore 240 pt. T-design quadrature was used for the remainder of the study to ensure minimal error between the numerical integration method.

Inverse Filter Calculation
For the three tested orders of Ambisonics, 240 points around the sphere, distributed using T-design quadrature (21-design), were encoded into binaural Ambisonics. Diffuse-field responses were simulated from the RMS of each binaural Ambisonic render with solid-angle weighting separately for each ear.
Linear-phase inverse filters were then calculated from the diffuse-field responses using Kirkeby and Nelson's least-mean-square regularisation method [34], which produces perceptually preferred inversions to other currently available methods [35]. 1/4 octave smoothing was implemented using the complex smoothing approach of [36], and the range of inversion was 2 Hz-20 kHz, with in-band and out-band regularisation of 25 dB and 5 dB, respectively. The target response of the inverse filter was the diffuse-field response of the original HRTF dataset.
The diffuse-field responses, inverse filters and resulting equalised frequency responses of the 1st, 3rd and 5th-order binaural Ambisonic loudspeaker configurations are presented in Figure 4. The plots show how the diffuse-field responses vary significantly for all 3 orders above f alias , with deviations as large as 9 dB for 1st order.  DFE was implemented through offline convolution of the calculated inverse filters with the HRIRs used in binaural Ambisonic rendering for each loudspeaker configuration. With truncation of the processed HRIRs, implementation therefore comes at no additional real-time computational cost.
To assess whether the need for DFE applies to other HRIR datasets, the diffuse-field response calculations were repeated for 5th-order using the 18 human datasets of individualised HRIRs from the SADIE II (Spatial Audio for Domestic Interactive Entertainment: https://www.york.ac.uk/sadieproject/.) database [37]. The calculated diffuse-field responses are illustrated in Figure 5, which shows the necessity for diffuse-field equalisation for all HRIR datasets.

Evaluation
DFE of Ambisonics was evaluated in three ways by comparing binaural Ambisonic renders, with and without DFE, to a reference HRIR data set [25] of 16020 HRIRs (89 elevations at 180 azimuths in 2 • increments). The objective change in spectral difference between binaural Ambisonic rendering and standard binaural rendering (HRIR convolution) when implementing DFE was investigated, to measure the influence of DFE over all directions on the sphere. Sagittal plane localisation was assessed through a binaural model, which allows evaluation of many more directions than would be feasible in a listening test. Finally, two listening tests were conducted to measure the perceptual effect of DFE on timbral similarity between binaural Ambisonic rendering and standard binaural rendering.
For each angle of the reference data set, binaural Ambisonic renders were generated and diffuse-field equalised through convolution of each render with the inverse filters of that order of Ambisonics. The reference, Ambisonic and DFE Ambisonic HRIRs were truncated to 1024 taps with 50 sample in/out half-Hanning windows applied. Linear phase was preserved.

Spectral Difference
Basic spectral difference, calculated from the difference between the magnitude value of each frequency bin of the fast Fourier transforms (FFTs) of two audio files, is not an accurate metric for human auditory perception on its own, as the human auditory system's sensitivity differs greatly depending on relative amplitude, frequency and temporal aspects [38]. Therefore, the objective spectral difference calculation aimed to more accurately represent the perceptual effect of absolute spectral differences, and is here on in referred to as perceptual spectral difference (PSD) [39].
PSD was calculated between the Ambisonic HRIRs and reference HRIRs as follows. The HRIR data sets were converted to HRTFs, the frequency domain equivalent of HRIRs, using an FFT with a window size of 4096 samples. Each frequency bin value was amplitude weighted according to ISO Standard 226 equal loudness contours [40] to account for human frequency-dependent amplitude sensitivity. The value of each frequency bin was then weighted to reflect human perception of loudness using the sone scale, at a ratio of +10 phons per doubling of perceived loudness [38,41,42], which also accounts for human auditory features such as spectral peaks being more perceptually significant than notches [43]. Equivalent rectangular bandwidth (ERB) filters were utilised to compensate for the linear frequency interval sampling of the FFT. Rendered data sets were normalised to the mean perceptual loudness level of the reference data set, and PSD was calculated as the absolute difference between each frequency bin of the two data sets.
PSD between reference HRIRs and Ambisonically generated HRIRs, with and without DFE, was calculated for 16020 directions for each tested order. Figure 6 displays the mean absolute values of PSD between Ambisonic HRIRs and reference HRIRs with and without DFE over the sphere (mean of left and right ear PSD calculations). The figures show DFE implementation improves spectral reproduction for a large amount of the sphere, particularly for 1st and 5th-orders, but makes it worse at lateral directions.  The mean absolute values of PSD between reference HRIRs and Ambisonic HRIRs with and without DFE across all angles on the sphere are presented in Table 3. Values were solid-angle weighted to account for the clustering of points at the poles in Gaussian quadrature. The table shows that DFE reduces the overall PSD between Ambisonic HRTFs and reference HRTFs for all three tested configurations.

Sagittal Plane Localisation
The effect of DFE on elevation localisation in the sagittal plane was assessed using a perceptual model [44], which compares two data sets of HRIRs (in this case Ambisonic and reference), and simulates human auditory processing to predict height localisation. It produces two psychoacoustic performance metrics: quadrant error (QE), a prediction of localisation confusion (presented as a percentage), and polar RMS error (PE), a prediction of precision and accuracy in degrees. The frequency range of the model's filterbank was set to 1.5-18 kHz due to the KU 100's lack of torso and therefore no elevation cues below 1.5 kHz [45], and the upper limit of human hearing. Table 4 shows the predicted QE and PE values for Ambisonic orders 1, 3 and 5 and Figure 7 illustrates predicted sagittal plane localisation. Results indicate improved localisation performance with the implementation of DFE for all three tested configurations.

Perceptual Listening Tests
To evaluate the perceptual effect of DFE on timbre in binaural Ambisonic rendering, two listening tests were conducted on 20 participants aged between 20 to 38 (17 male, 2 non-binary, 1 female) with normal hearing (ISO Standard 389 [46]) and prior critical listening experience (such as education or employment in audio or music engineering).
Tests were conducted in a quiet room using a single set of Sennheiser HD 650 circum-aural headphones and an Apple MacBook Pro with a Fireface UCX audio interface, which has software controlled input and output levels. The headphones were equalised using a Neumann KU 100 dummy head from 11 measurements using the swept sine impulse response technique [47] with re-fitting of the headphones between measurements and 1 octave band smoothing in the inverse filter.
Prior to the tests, participants were given the ANSI S1.1-1994 definition of timbre [48] and taken through a training exercise to familiarise themselves with the graphical user interface and task. The base stimulus was one second of monophonic pink noise at a sample rate of 48 kHz, windowed by onset and offset half-Hanning ramps of 5 ms. Each test sound was generated by convolving the pink noise with a HRIR; either Ambisonic or not. The test sound locations (ψ) corresponded to the central points of the faces of a dodecahedron. To reduce the total number of trials, symmetry was assumed and thus only locations in the left hemisphere were used, amounting to 8 locations (see Table 5). Test sounds were normalised to a corresponding A-weighted level of between 65 and 70 dB SPL which is in line with conversational speech listening levels.  The first listening test followed the multiple stimulus test with hidden reference and anchor (MUSHRA) paradigm, ITU-R BS.1534-3 [49]. The reference was a direct HRIR convolution, and medium and low anchors were low-pass filtered versions of the reference stimulus with an f c of 7 kHz and 3.5 kHz, respectively. The other 6 stimuli were the binaural Ambisonic renders for three Ambisonic orders, with and without DFE. For each trial, the listener was asked to rate the 9 stimuli in terms of timbral similarity to the reference. The 8 test sound locations were repeated giving a total of 16 trials. The presentation of stimuli and trials was randomised and double blind.
The second listening test was an AB comparison. Participants were presented with two sets of three consecutive stimuli (corresponding to Ambisonic renders of 1st, 3rd and 5th-orders), one set of which was diffuse-field equalised, and were asked to rate them in terms of timbral consistency. The 8 test sound locations were repeated with a different arrangement of the Ambisonic orders (the first was 1, 3, 5 and the second was 1, 5, 3), giving a total of 16 trials. The presentation of trials was randomised and double blind.

Results
No participants results were excluded, based on the criteria of rating the hidden reference less than 90% for more than 15% of trials or rating the mid-anchor higher than 90% for more than 15% of trials. The results from both listening tests were tested for normality using the Kolmogorov-Smirnov test, which showed all data as non-normally distributed. As a result, all statistical analysis was conducted using non-parametric methods.
The median results of the MUSHRA test, conducted to determine whether DFE reduces the differences in timbre between binaural Ambisonic rendering and HRIR convolution, are shown in Figure 8 for each condition across all test sound locations, with non-parametric 95% confidence intervals (CI) [50]. Friedman's Analysis of Variance (ANOVA) tests showed a statistically significant difference (χ 2 (5) = 247.6, p < 0.05) between standard and DFE binaural Ambisonic rendering for all tested orders and sound locations. 1st-order Ambisonics showed the most improvement, followed by 5th and 3rd.
The perceptual effect of DFE was found to vary with test sound location, with a Friedman's ANOVA showing this variation to be statistically significant (χ 2 (7) = 127.8, p < 0.05). Figure 9 shows the median results with non-parametric 95% CI for each test sound location ψ.
Post-hoc Wilcoxon signed-rank tests determined which test conditions with DFE produced a significant improvement in timbre; the results of which are shown in Table 6. For both 1st and 5th-order Ambisonics, DFE was shown to bring the timbre of binaural Ambisonic rendering closer to HRIR convolution with statistical significance for 5 of the 8 test sound locations. Results for 3rd-order were much less clear and only significant for one test sound location.
The median results of the second listening test, the AB comparison conducted to determine whether DFE improved the consistency of timbre between Ambisonic orders, are shown in Figure 10 for both conditions across all test sound locations, with non-parametric 95% CI. Overall across all test sound locations, DFE produced higher timbral consistency between different Ambisonic orders, and a Friedman's ANOVA test showed that this was statistically significant (χ 2 (1) = 8.45, p < 0.05). To assess how perceived timbral consistency varied with test sound location, a second Friedman's ANOVA was conducted and showed significance (χ 2 (7) = 37.5, p < 0.05). Figure 11 shows the median AB results with non-parametric 95% CI for each test sound location ψ. Post-hoc Wilcoxon signed-rank tests to determine which test sound locations produced statistically significant results were conducted; the results of which are displayed in Table 7.

Discussion
The results showed that binaural Ambisonic rendering with DFE produced an overall improvement in high-frequency reproduction over standard binaural Ambisonic rendering when compared to direct HRIR rendering, as well as improved timbral consistency between different orders of Ambisonics. However, results were not hugely significant, indicating that even with DFE, binaural Ambisonic reproduction still varies considerably in timbre between orders and between direct HRIR rendering.
Results for the 1st and 5th-order configurations were more substantial than 3rd-order, which is likely due to the greater variation in diffuse-field responses for the 1st and 5th-order configurations, and at frequencies with more perceptual importance (see again Figure 4). Thus for these two configurations, equalisation produced a more significant improvement.
Results were shown to vary with sound source location, something that was evident in both the perceptual spectral difference calculations across the sphere and the results of the listening tests. However, no trends between the influence of DFE and specific sound source location attributes (such as increasing elevation or increasing lateralisation) were found. A possible general explanation for the directional variation in influence of DFE is that different sound source locations produce HRTFs that emphasise or cut different frequencies, and a boost or cut to a specific frequency band (i.e. from the diffuse-field equalisation) will therefore affect certain locations more than others.

Conclusions
The inaccuracies of high-frequency reproduction in Ambisonics, caused by comb filtering from the summation of multiple analogous signals at the ears, have been addressed in this paper through the application of the diffuse-field equalisation technique. By implementing diffuse-field equalisation in binaural Ambisonic rendering as a low-computation additional stage to the binaural rendering process, the diffuse-field response of the binaural Ambisonic loudspeaker configuration is flattened out, which changes the frequency response of renders at individual sound source locations. This has been shown to, on average over all directions on the sphere, bring the spectral reproduction of binaural Ambisonic rendering closer to direct HRIR rendering.
For the three tested loudspeaker configurations, corresponding to 1st, 3rd and 5th-order Ambisonics, DFE produced small improvements in spectral difference over the sphere and sagittal plane localisation predictions. Listening tests on timbre corroborated this, though not all test conditions were statistically significant. Listening tests also showed that DFE produces a small improvement in timbral consistency between different orders of Ambisonics. This paper has shown that a low-computation equalisation pre-processing stage can produce an incremental improvement in the high-frequency reproduction of binaural Ambisonic rendering, however there still exists a significant difference between binaural Ambisonic rendering using the virtual loudspeaker approach and direct HRIR rendering. Therefore, diffuse-field equalisation alone is not enough to eradicate the timbral issues posed by Ambisonic rendering, and should be used as a basis for future research in improving Ambisonic rendering at high frequencies.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.