Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room

May, Lloyd; Farzaneh, Nima; Das, Orchisama; Abel, Jonathan S.

doi:10.3390/acoustics7030056

Open AccessArticle

Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room

¹

Center for Computer Research in Music and Acoustics (CCRMA), Stanford University, Stanford, CA 94305-8180, USA

²

Department of Engineering, King’s College, London WC2R 2LS, UK

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Acoustics 2025, 7(3), 56; https://doi.org/10.3390/acoustics7030056

Submission received: 25 June 2025 / Revised: 14 August 2025 / Accepted: 19 August 2025 / Published: 6 September 2025

Download

Browse Figures

Versions Notes

Abstract

Simulated room impulse responses (RIRs) are important tools for studying architectural acoustics. Many methods exist to generate RIRs, each with unique properties that need to be considered when choosing an RIR synthesis technique. Despite the variation in synthesis techniques, there is a dearth of comparisons between these techniques. To address this, a comprehensive comparison of four major categories of RIR synthesis techniques was conducted: wave-based methods (hybrid FEM and modal analysis), geometrical acoustics methods (the image source method and ray tracing), delay-network reverberators (SDNs), and statistical methods (Sabine-NED). To compare these techniques, RIRs were recorded in a simple shoebox-shaped racquetball court, and we compared the synthesized RIRs against these recordings. We conducted both objective analyses, such as energy decay curves, normalized echo density, and frequency-dependent decay times, and a perceptual assessment of synthesized RIRs, which consisted of a listening assessment with 29 participants that utilized a MUSHRA comparison methodology. Our results reveal distinct advantages and limitations across synthesis categories. For example, the Sabine-NED technique was indistinguishable from the recorded IR, but it does not scale well with increasing geometric complexity. These findings provide valuable insights for selecting appropriate synthesis techniques for applications in architectural acoustics, immersive audio rendering, and virtual reality environments.

Keywords:

virtual acoustics; room impulse response synthesis; perceptual evaluation

1. Introduction

Virtual acoustics has become an essential field for recreating the acoustics of physical spaces in digital environments, supporting applications in immersive audio rendering, architectural simulations, and extended reality (XR) [1,2]. By accurately modeling how sound propagates and interacts with an environment, virtual acoustics enhances realism in games and AR/VR experiences [3,4]. Furthermore, it enables historical reconstructions of vanished spaces, allowing researchers to study the acoustics of past environments and their impact on site-specific music and cultural heritage [5,6]. Neuroscientific research also suggests that architectural acoustics influences cognition and perception, underscoring the importance of realistic simulations for research in psychoacoustics and spatial cognition [7].

A critical component of virtual acoustics is room impulse response (RIR) synthesis, which simulates how sound propagates in a given space. An ideal RIR synthesis method should balance accuracy, computational efficiency, and adaptability. Accuracy refers to how closely a synthesized RIR replicates a real-world impulse response, often evaluated using perceptually motivated metrics such as normalized echo density [8], reverberation time (

T_{60}

) [9], and spatial reflection patterns [10]. Computational efficiency is crucial for real-time applications such as gaming and interactive AR/VR, where rendering must occur with minimal latency [11]. The complexity of an RIR synthesis method also depends on whether it must generate a single static response or compute a spatially dynamic IR map for a moving listener [12].

Several methods exist for synthesizing RIRs, each with trade-offs in accuracy and efficiency [11,13]. Wave-based simulations, such as the Finite-Difference Time-Domain (FDTD) method [14], the Boundary Element Method (BEM) [15], and the Finite Element Method (FEM) [16] solve the 3D wave equation given the appropriate geometric details of the space and its boundary conditions. They discretize finite volume or surface patches and solve for the pressure and normal velocity for each patch. Hence, they can capture low-frequency behavior and diffraction effects. For high accuracy, the discretization mesh must have adequate spatial resolution, and a high sampling frequency is required to avoid dispersion errors [17]. Thus, their high computational cost limits real-time applicability. Modal analysis is another technique that is physically motivated. Modes, or standing waves, are solutions to the Helmholtz equation [18]. Modal parameters can be estimated from the room impulse response using matrix-pencil methods [19], or simple peak-picking from the magnitude spectrum [20]. The reverberated output can be generated efficiently in real time by passing the input signal through a network of parallel biquad filters, each synthesizing a single mode [20].

Geometrical acoustics (GA) methods, on the other hand, treat acoustic waves as rays [12]. This assumption is valid in high-frequency applications, where these methods are most accurate. The image source method (ISM) [21] belongs to this category. The ISM places multiple mirror sources behind each reflecting surface and calculates the acoustic propagation path from the source to each mirror source and from the mirror source to the receiver. It provides an exact solution to the wave equation for a parallelepipedical room with rigid walls. All reflections are assumed to be specular, and there is no diffraction due to perpendicular angles between walls. These assumptions do not hold in real rooms. The image method was extended to arbitrary polyhedra in [22]. A more efficient ISM for RIR synthesis in shoebox rooms was proposed in [23].

Ray tracing is another popular GA method [24,25]. Millions of rays are cast from a sound source, reflect off each surface, and propagate towards the receiver. The rays emitted from the source according to a pre-defined distribution or in random directions are found using Monte Carlo search. A widely used GA-based software for room acoustic simulations is ODEON, which combines ISM and ray-tracing techniques to model sound propagation efficiently. ODEON incorporates a scattering algorithm that accounts for both specular and diffuse reflections, improving the accuracy of room acoustic predictions compared to pure ISM or ray-tracing approaches [26]. Furthermore, ODEON integrates diffraction modeling through secondary-source edge diffraction algorithms, enabling more realistic early reflection modeling, particularly in non-shoebox spaces [27]. This approach allows ODEON to achieve better alignment with measured RIRs, making it a preferred tool for architectural acoustics, concert hall design, and auralization studies [28]. Unlike path-based methods such as ray tracing and ISM, surface-based GA methods, such as acoustic radiance transfer [10], store the intermediate energy collected by all surfaces as the source propagates to the receiver.

Hybrid approaches attempt to combine the strengths of GA and wave-based methods, improving low-frequency accuracy while maintaining efficiency, and have gained popularity due to their trade-off between computational complexity and accuracy [29,30]. Recently, cloud based simulation suites, such as Treble (https://www.treble.tech/software-development-kit, accessed on 1 September 2024), have used hybrid approaches for simulating acoustics in complex scenes with high precision [31].

Another widely used class of RIR synthesis methods is delay-network reverberators, which are optimized for real-time rendering rather than physical accuracy. While most synthesis techniques yield RIRs that need to be convolved with the dry input signal, delay networks process the input signal directly and give a reverberated output. Feedback Delay Networks (FDNs) use a system of delay lines with recursive filtering to synthesize reverberations [32].

The FDN produces echoes whose density increases polynomially with time, until it resembles white noise. This is purely a perceptual model that tries to model the switch from an early response to a diffuse late reverb. Similarly, statistical methods, such as white-noise shaping in critical bands and velvet-noise reverberators [33], try to create a perceptually accurate diffuse field by modeling the decay time of dense noise sequences in frequency bands. For this study, a statistical method that generates a pattern of echoes based on an empty shoebox room was implemented. The proposed method filters the generated pattern of echoes to have a desired decay time based on Sabine’s formula. This method will be referred to as the Sabine-NED method.

Other physically accurate delay networks include the digital waveguide network (DWN) [34]. DWNs have a network of bidirectional delay lines interconnected at lossless junctions. They represent a network of interconnected acoustic tubes. Inspired by the DWN, the scattering delay network (SDN) was proposed in [35]. The SDN is a more efficient architecture that models the first-order reflections in a shoebox room exactly and approximates the higher-order reflections. Due to its efficiency, it can model moving sources and listeners in real time. More recent improvements, such as higher-order SDNs, have further enhanced their realism while keeping computational costs low [36,37]. SDNs extend FDNs by explicitly modeling early reflections, offering a balance between computational efficiency and spatial accuracy.

While previous studies have explored individual RIR synthesis methods, to our knowledge, no research has conducted a systematic objective and perceptual comparison of the major categories of RIR synthesis techniques—wave-based, geometrical acoustics, delay-network-based, and statistical methods—for a simple shoebox-shaped room. Prior work, such as that by Brinkmann et al. [28], evaluated the perceptual realism of auralizations from five commercially available software suites (ODEON, RAVEN, EASE, RAZR, and BRASS). However, their analysis focused primarily on geometrical acoustics methods, with RAZR incorporating an FDN for late reverberation synthesis. Additionally, studies such as those by Djordjevic et al. [38] and Mi et al. [39] examined the perceptual realism of artificial reverberation algorithms, yet they did not compare them directly to measured RIRs in controlled environments. Despite advances, no current method achieves perceptual indistinguishability [28,40].

This study aims to bridge this gap by conducting a comprehensive comparison of RIR synthesis techniques, evaluating objective metrics such as energy decay curves, normalised echo density, reverberation times, and spatial characteristics, as well as perceptual quality through listening tests conducted using the MUSHRA methodology [41]. Unlike previous work, the synthesized RIRs were validated against measured impulse responses from a shoebox-shaped racquetball court, allowing for a controlled and repeatable evaluation. The goal of this work is to provide insights into the strengths and weaknesses of each method, guiding their appropriate use in applications such as architectural acoustics, immersive audio rendering, and virtual reality environments.

This paper presents a comparative evaluation of four major categories of RIR synthesis techniques for a simple shoebox room against the following measurements:

Wave-based methods (hybrid FEM and modal analysis);
Geometrical acoustics methods (the image method and ray tracing);
Delay-network reverberators (SDNs);
Statistical methods (Sabine-NED).

The rest of this paper is organized as follows: In Section 2, the configuration of the measurement recordings is described, and additional details of the measured space are provided. In Section 3, the synthesis methods compared in this study are described. Section 4 presents objective metrics and the results of the listening test comparing various synthesis methods against the measured reference. Finally, the advantages and shortcomings of each method are discussed in Section 5, and they are concluded in Section 6.

2. Ground-Truth IR Recordings

A racquetball court measuring 20′ × 40′ × 20′ (6.1 m × 12.2 m × 6.1 m), as shown in Figure 1, was chosen as a simple shoebox-shaped room. The walls and ceiling of the court were made of medium-density fibreboard (MDF). The door section of the back wall was made of 0.5″ (0.0127 m) thick glass, and the floor was varnished wood slats. Impulses were generated by either popping a near-spherical balloon or playing a 10 s sinewave sweep using a Bose SoundLink Revolve Plus ii speaker. Recordings were made at 48 kHz using a 4-channel Core Sound TetraMic using a Zoom F8n field recorder. All equipment was calibrated before recordings.

The recordings were made on a court with racquetball courts on either side at the Arrillaga Family Racquetball & Gymnastics Center on Stanford University’s campus. The recordings were taken during a single recording session in March 2021 during which very little activity was present on campus. This led to the recordings having no audible external sounds or interruptions. All sources (balloon or speaker) and receivers (TetraMic) were placed 5ft (1.52 m) above the ground from their respective acoustic centers. The height of the sources and receivers remained constant during the recordings. The room was 22 °C at the time the measurements were made. The source and the receiver were moved to 10 predetermined locations in the room, as shown in Figure 2, to adequately capture the effect of positional variation.

3. IR Synthesis Techniques

We explore and compare the image method, hybrid ray-tracing simulations, modal synthesis, hybrid FEM, the SDN, and the Sabine-NED method as synthesis techniques in this paper. These methods are used to re-synthesize the measured RIRs at various source–receiver locations in the racquetball court.

3.1. Image Method

The image method, which was developed by Jeffrey Borish [22], approximates the room impulse response between a source and a listener by generating a set of virtual sources or “images” representing the effect of reflecting surfaces on the actual source. The approximation is valid when the wavelength of sound considered is significantly less than the dimensions of the reflecting surfaces comprising the modeled space. It gives an exact solution to Green’s function in shoebox-shaped rooms with rigid walls [21], and it is used to compute early reflections from large, flat surfaces in a space.

The method may be described by considering that the sound field created by a source in the presence of a single, infinitely-large reflecting surface is equivalent to that created by two sources; the original source and an “image” source are positioned and oriented by reflecting the original source through the surface. In this way, the source and the reflecting surface have been replaced by two sources. Similarly, a source in an enclosed space defined by a set of large, flat reflecting surfaces may be replaced by the original source and a set of image sources that are appropriately positioned and oriented. The impulse response of the space is then modeled as the set of pulses that radiated from the image sources, and they are filtered according to the associated surface interactions.

A space is modeled by first reflecting the source through planes containing each of the surfaces to create a set of image sources. In turn, each of these first-order image sources is reflected through each reflecting surface that has its interior facing the image source to generate second-order source images. This process is repeated to generate a cloud of image sources that is sufficiently large to produce the desired room impulse response duration. Here, RIRs were generated using an image method with a reflection order of 100.

The room impulse response is then found by summing the responses of the source and virtual sources individually. This is conducted by first verifying whether each of the image sources is visible to the listener, and then delaying and scaling an impulse emitted by each visible image source according to its distance to the listener, applying any source radiation and listener antenna patterns, and filtering the processed pulse according to the material’s properties of the reflecting services encountered between the virtual source and the listener. Finally, a time-varying filter modeling air absorption as a function of propagation distance (as described in [42]) is applied.

The racquetball court is a case for which the image method is expected to produce a good approximation to the measured impulse response. In the interest of computational efficiency, rather than modeling the racquetball court as being made of a set of surfaces with separate materials for the walls, floor, ceiling, recessed lightboxes, and plexiglass rear wall portion, all of the surfaces were assumed to have the same reflection properties, with the properties designed to reproduce the reverberation time computed using the Sabine formula (without air absorption) [9]. The Sabine formula relates the reverberation time,

T_{60}

, to the volume of the space, V, and the absorbing surface area,

\sum_{i} α_{i} S_{i}

, where

α_{i}

is the absorption coefficient and

S_{i}

is the area of the ith surface.

\begin{matrix} T_{60_{b}} & = 0.161 \frac{V}{\sum_{i} α_{i_{b}} S_{i}} \approx 0.161 \frac{V}{α_{b} \sum_{i} S_{i}}, \\ α_{b} & \approx 0.161 \frac{V}{T_{60_{b}} \sum_{i} S_{i}}, \\ r_{b} & = \sqrt{1 - α_{b}} . \end{matrix}

(1)

Here, the subscript b denotes a frequency band. The coefficients specified in Table 1 is used to calculate

T_{60_{b}}

, and the overall absorption coefficient

α_{b}

was calculated from

T_{60_{b}}

. Then, the reflection coefficient

r_{b}

is calculated from

α_{b}

. (Note that while the absorption coefficient

α_{b}

describes the energy absorbed, the reflection coefficient

r_{b}

relates to the amplitude of the reflected energy. Accordingly, the reflection coefficient is the square root of the portion of the energy reflected, that is, one minus the portion of the energy absorbed.) IIR filters are fitted to the band-wise reflection coefficients

r_{b}

using frequency-warped Prony’s method [43] to obtain frequency-dependent reflection filters. By doing this, there is no need to apply a specific set of filters to each of the myriad of image sources, as all image source responses of a given order may be summed, then filtered once according to the generic material’s filter.

We should point out that in the racquetball court, certain source or listener positions with respect to the racquetball court, such as halfway or one-third of the way between a pair of walls, will produce prominent sweeping echoes or chirps in the impulse response, as explained in De Sena, Antonello, Moonen and Van Waterschoot [44]. These chirps are observed in the impulse response measurements, though they are more subtle than in the image method simulations.

The complex image method [45,46] deals with the sweeping echoes by using the spherical-wave reflection function instead of the plane-wave reflection coefficient in the image method. However, this method is computationally more expensive. Therefore, the sweeping echoes were mitigated by adding a small random displacement to the positions of the sources and the listeners as proposed in [44].

3.2. Ray-Tracing Hybrid

ODEON is a widely used software for room acoustic modeling, making it particularly suitable for geometric models with high modal overlap [26]. The software combines multiple geometrical acoustic (GA) algorithms to achieve accurate impulse response predictions while optimizing computational efficiency. It integrates ray tracing, the image source method (ISM), the early scattering method (ESM), and the ray radiosity method (RRM) to model both early and late reflections effectively.

For early reflections, ODEON employs a combination of ray tracing and ISM, ensuring that specular reflections and material-dependent absorption are accurately captured [21,22]. In this phase, multiple rays are generated and emitted from the sound source, with their paths traced until they reach the receiver. The early scattering method further refines this process by simulating diffuse reflections and introducing secondary scattered rays to account for surface roughness and non-specular behavior [26].

Once the simulation reaches a transition order, where reflections become more randomized, ODEON switches to the ray radiosity method for late reflections. Ray radiosity modifies traditional ray tracing by incorporating a detection sphere around the receiver, ensuring that late reverberant energy is captured uniformly [24]. Unlike real-world measurements that operate in the pressure domain, ODEON’s calculations occur in the energy domain, simplifying statistical modeling of late reverberations.

This hybrid approach allows ODEON to achieve more realistic room impulse responses than traditional GA-only models, making it highly effective for architectural acoustics, auralization, and concert hall simulations [27]. Despite its strengths, ODEON does not fully account for wave-based effects like diffraction and phase interference, limiting its accuracy in low-frequency modeling and edge diffraction scenarios [28]. Future improvements could involve hybridizing ODEON’s GA approach with wave-based solvers to enhance low-frequency response accuracy while maintaining computational efficiency.

3.2.1. Three-Dimensional Model and Simulation Parameters

Table 2 outlines the key settings and configurations used in the ODEON simulation.

3.2.2. Material Absorption Coefficients

Table 1 presents the absorption coefficients assigned to various materials in the modeled environment. The absorption coefficients were retrieved from the Odeon material properties library, which has compiled coefficients from the relevant literature [47,48,49].

3.3. Modal Analysis Method

The response of an acoustic space may be found by considering the modes of the space, as well as the frequencies and spatial patterns associated with standing waves in the space. These standing waves are solution to the frequency-domain counterpart of the wave equation, i.e., the Helmholtz equation. The impulse response

h (t)

between a source and a listener may be written as the following linear combination of characteristic resonances:

h (t) = ℜ [\sum_{m = 1}^{\infty} γ_{m} exp {j ω_{m} t - α_{m} t}],

(2)

where

ω_{m}

represents the mode frequency,

α_{m}

represents the mode damping, and

γ_{m}

represents the mode gain, which is given by the product of the mode spatial pattern evaluated at the source and listener positions,

ψ_{m} (ξ_{s}) ψ_{m} (ξ_{l})

, i.e., the strength of sound source and listener coupling to the modes at their respective positions (

ξ_{s}

for the source and

ξ_{l}

for the listener) and for their respective polar patterns.

In a shoebox-shaped room with rigid surfaces and dimensions of

(L_{x}, L_{y}, L_{z})

, the modal shapes are sinusoidal along the axes of the shoebox, with wavelengths equal to half-integer multiples of the side lengths [7],

ψ (x, y, z) = cos (\frac{m_{x} π x}{L_{x}}) cos (\frac{m_{y} π y}{L_{y}}) cos (\frac{m_{z} π z}{L_{z}}),

(3)

and oscillate at their respective mode frequencies,

ω_{m_{x} m_{y} m_{z}} = 2 π c {[\frac{m_{x}^{2}}{4 L_{x}^{2}} + \frac{m_{y}^{2}}{4 L_{y}^{2}} + \frac{m_{z}^{2}}{4 L_{z}^{2}}]}^{\frac{1}{2}},

(4)

where c is the speed of sound,

x, y, z

are 3D Cartesian coordinates, and

m_{x}, m_{y}, m_{z} \in Z^{+}

. Here, the rigid surfaces lead to mode responses which do not decay, with

α_{m_{x} m_{y} m_{z}} = 0

. When the shoebox surfaces are not perfectly reflecting, the mode responses will decay over time. In addition, while the mode spatial patterns still oscillate over space, their amplitudes will vary according to the surface reflection coefficients [50].

The number of modes needed to simulate a given room’s impulse response increases with the cube of the sampling rate,

f_{s}

,

\sum_{m_{x}, m_{y}, m_{z}} 1 \{ω_{m_{x} m_{y} m_{z}} < f_{s} / 2\} .

(5)

For a sampling rate of 48 kHz (a 24 kHz Nyquist limit), a simulation of wavelengths down to roughly

1.5

cm is needed, requiring contributions from around 155 million modes. In the interest of computational savings, the impulse response was simulated out to 8 kHz, requiring around

5.7

million modes. Additionally, as the racquetball court’s surfaces are highly reflective, the mode dampings

α_{m_{x} m_{y} m_{z}}

would be close to zero. Accordingly, the simulation was simplified by setting the dampings to zero, summing the mode responses for mode frequencies up to the desired bandwidth to form an undamped response, and then applying a time-varying filter representing the effect of the overall decay rate, including contributions from the surface materials and air absorption.

3.4. Finite Element Analysis Hybrid Method

The Finite Element Method [51] is used to obtain the pressure field in the racquetball court. In the FEM, a weak formulation is solved by integrating the Helmholtz equation with a weighting function. The integral is then discretized with shape functions, and it solved for specific wave numbers on a fine spatial grid. The ‘Robin-type’ boundary conditions impose wall absorption filters. Since the FEM is a wave solver, it gives an exact solution of the pressure field in the court.

The Helmholtz equation with an impedance boundary condition is given by

\begin{matrix} \nabla^{2} p + k^{2} p & = - Q (k, x) \\ \nabla p . \hat{n} & = j k β (k) p, \end{matrix}

(6)

where p is the pressure field,

\hat{n}

is the normal vector at the boundary, k is the wave number,

Q (k, x)

is the position and a frequency-dependent source,

β (k)

is the frequency-dependent boundary admittance, and ∇ is the gradient operator. The weak form of the Helmholtz equation is given by adding a weighting function, q, and integrating over the entire volume,

Ω

. The boundary condition is imposed by a surface integral of the admittance over all the boundaries.

\begin{matrix} \int_{Ω} (k^{2} q p - \nabla q \nabla p + q Q (k, x)) d Ω + \sum_{i = 1}^{N} \int_{S_{i}} j k β_{i} (k) p q d s = 0 \end{matrix}

(7)

In the simulations, a sampling frequency of

44.1

kHz was used. The pressure field was calculated within a cuboid mesh of the same dimensions as the racquetball court. Monopole sources of the nature

Q (k, x) = \frac{1}{| | x | |} exp (- j k | | x | |)

were placed at

x = (5.3, 6.7, 1.5)

m and

(3.048, 9.14, 1.5), (3.048, 6.1, 1.5)

m.

To calculate the impedance at the boundary as a function of frequency, the same reflection filters are used as in the image method (Section 3.1),

\begin{matrix} β (e^{- j ω}) & = \frac{1}{ρ_{0} c} \frac{1 - r (e^{- j ω})}{1 + r (e^{- j ω})} \end{matrix}

(8)

where

r (e^{- j ω})

is the reflection filter’s frequency response at frequency

ω

,

β (e^{- j ω})

is the wall admittance at frequency

ω

, and

ρ_{0} c

is the characteristic impedance of air (c is the speed of sound in air). From the reflection filters, the wall impedance is calculated for a specific wave number using

k = ω / c

. The weak form is solved for

N = 2 f_{s} max (T_{60})

frequencies with FreeFEM++ [52] utilizing a linear shape function.

The FEM runs in the frequency domain and is used for obtaining the room transfer function up to the Schroeder frequency (since other geometric methods are least accurate in the low-frequency region, and FEM computation over the entire frequency range is extremely time-consuming and expensive). The Schroeder frequency is calculated from the average reverberation time (

T_{60}

) as [7]

f_{Schr} = 2000 \sqrt{\frac{max (T_{60})}{l \times w \times h}} .

(9)

Here,

l, w,

and h stand for the length, width, and height of the shoebox room. For the room in consideration, the Schroeder frequency is 210 Hz. A mesh spacing of

δ < \frac{c}{2 f_{Schr}}

must be chosen to obtain accurate results up to the Schroeder frequency.

δ = 0.2

m was chosen.

The pressure field is obtained from the FEM simulation for the entire 3D mesh. Then, RIRs at mesh positions nearest to the desired receiver locations are selected. A two-pole high-pass filter with the cutoff at 20 Hz is used to smooth the spurious DC response. Beyond the Schroeder frequency, the image method is used to calculate the RIR. At the Schroeder frequency, the frequency responses of the FEM and ISM are combined after compensating for the difference in their magnitude response’s energy. An inverse FFT yields the final hybrid RIR.

3.5. Scattering Delay Networks

Scattering delay networks [35] (SDNs) consist of delay lines connecting wall junctions that are placed at the location of first-order reflections. The delay line lengths are chosen according to the source–listener positions and room geometry. All the wall junctions are connected to each other via a scattering matrix, which introduces recursion in the network. SDNs can render first-order reflections accurately and approximate higher-order reflections coarsely. The design is ideal for real-time implementation within dynamic scenes, and it generally provides higher perceived naturalness than Feedback Delay Networks since early reflections are modeled correctly [38].

In the simulations, the RIR with the first-order SDN at a sampling frequency of 44.1 kHz was calculated. Bidirectional delay lines connect the nodes at the first-order reflection positions. The lengths of the delay lines are determined by the time taken for sound propagation from one node to another. An isotropic scattering matrix distributes the energy among the nodes, with

S = \frac{2}{N - 1} 11^{T} - I

(10)

where

N = 6

is the number of walls in the room,

1 \in R^{(N - 1) \times 1}

is a vector of ones, and

I \in R^{(N - 1) \times (N - 1)}

is an identity matrix.

Unidirectional delay lines connect the source to the receiver, the source to the nodes, and the nodes to the receiver. The cuboid room used in the simulation has the same dimensions as the racquetball court. The source is placed at

(5.3, 6.7, 1.5)

m, and the pressure output at multiple receiver positions is calculated. The reflection filters are calculated from the RIRs using the same method as in Section 3.1. An air absorption FIR filter is added in each delay line at a temperature of 20 °C and an atmospheric pressure of 101.325 Pa.

3.6. Sabine-NED Method

The idea behind the Sabine-NED method is that the reverberation time as a function of frequency, together with the wet–dry mix and normalized echo density profile, goes a long way toward describing the psychoacoustic impression of an acoustic space. In the Sabine-NED method, an echo pattern is synthesized according to an estimated echo density as a function of time

η (t)

, and then processed to express the desired decay time as a function of frequency f,

T_{60} (f)

.

The design controls are similar to those of Feedback Delay Network (FDN) reverberators [32]. The FDN delay line lengths and off-diagonal energy of the orthonormal feedback matrix will determine the time needed for the impulse response to evolve from a sparse, sputtery-sounding response to a smooth, noise-like one as the echo density builds, while the feedback filters determine the decay time as a function of frequency.

Factors such as temperature [53] and room occupancy [54] affect the acoustic characteristics of a room. In the case of the empty shoebox-shaped racquetball cart, which is without furniture or other diffusing objects, roughly one echo per racquetball court volume enclosed by a sphere radiating at the speed of sound from the source is expected. The total number of echoes in the space at time t is then roughly

\frac{4 π}{3} \cdot c^{3} t^{3} / V

, with c being the sound speed and V being the room volume [7]. Therefore at time t, the echo density as a function of time,

ρ (t)

, which is the time derivative of the total number of echoes in the space at time t, is roughly

ρ (t) = 4 π c^{3} t^{2} / V

(11)

Here, the echo density is simulated by generating pulse times according to a Poisson process with a mean time interval between pulses given by

ρ (t)

. Pulses having Gaussian-distributed amplitudes (proportional to the inverse root echo density so that the sequence energy over time is roughly constant) were placed at the statistically generated times via sinc interpolation. To save computation time, the echo density was limited to 500,000 echoes per second, which produces the perceptual equivalent of Gaussian noise.

The echo pattern will have a spectrogram that is roughly uniform over frequency and time and may be processed to match a desired spectrogram, in this case, by imprinting the reverberation time

T_{60} (f)

.

Note that while this Sabine-NED approach is expected to produce impulse responses having the overall character of the modeled space, it does not model any particular source or listener position and does not yield accurate early reflections.

4. Evaluation and Results

To compare the characteristics of the synthesized IRs to the measured IRs, an objective evaluation of the spectral and time-domain characteristics of the IRs was conducted in addition to a subjective listening evaluation.

4.1. Objective Evaluation

As mentioned in Section 2, two methods to measure the room’s acoustics were used: a 10-second sine sweep and a balloon pop, both conducted at multiple source and receiver positions. After a detailed comparison, the sine sweep method was chosen as the reference for evaluating the accuracy of synthesized impulse responses.

Objective Measures

For the comparative study, six key objective measures were employed to compare the synthesized and measured IRs:

Spectrogram: To illustrate the frequency characteristics of the IRs, the spectrograms were calculated as follows: First, the responses were re-centered by aligning to the onset of the IR and then normalized. The short-time Fourier Transform (STFT) of each time signal was then calculated with 50 percent overlap, and the magnitude was converted to dB using $20 \cdot {log}_{10} (\frac{| X |}{max (| X |)} + ε)$ . Both axes of time and frequency are logarithmically scaled. On a logarithmic scale, early reflections can be more easily seen, and it is close to the exponential shape of the reverb decay. Measured spectrograms are shown in Figure 3.
Arrival Times: The timing of early reflections was validated using an image source model of a rectangular room. The model generates virtual sources by mirroring the speaker position across room boundaries and calculates their distances and directions relative to the microphone. Comparing the modeled arrival times to the measured IRs helped assess how well the synthesis methods preserved early reflection patterns. For a given reflection index $r_{i} \in [- N, N]$ , where N is the maximum reflection order along axis i, the position of the image source in dimension i is computed as follows:

$s_{i}^{'} (r_{i}) = {(- 1)}^{| r_{i} |} \cdot s_{i} + r_{i} \cdot d_{i}$

(12)

where
- $s_{i}$ is the original speaker coordinate along axis i;
- $d_{i}$ is the length of the room along axis i;
- $r_{i}$ is the integer reflection count along that axis (positive or negative).
For each image source $s^{'}$ and the microphone position m, the vector from the source to the mic is

$v = s^{'} - m$

(13)

This formula accounts for the alternating nature of image positions based on the parity of the reflection index. The complete set of image source positions is then generated by evaluating all combinations of $r_{i}$ across each spatial dimension (2D or 3D), forming a grid of mirrored sources.
For each image source, the distance (range) is the Euclidean norm of the vector v, where

$Range = ∥ s^{'} - m ∥ = \sqrt{{(x_{s}^{'} - x_{m})}^{2} + {(y_{s}^{'} - y_{m})}^{2} + {(z_{s}^{'} - z_{m})}^{2}}$

(14)

The direction vector from the image source to the microphone is computed as a unit vector, where

$\hat{d} = \frac{s^{'} - m}{∥ s^{'} - m ∥}$

(15)

This lays the groundwork for computing arrival times that are shown with light gray colors in figures such as Figure 4b.
Band Decay Times: Decay times in multiple frequency bands $T_{60_{b}}$ were measured to evaluate how energy dissipates across the spectrum. This involved filtering the IR into frequency bands, smoothing the energy envelope, estimating the noise floor, and fitting a decay slope. The center frequencies are defined as $f_{b} = 125 \cdot 2^{[- 0.5 : 0.1 : 7]}$ . This yields fractional octave bands spanning approximately 88 Hz to 16 kHz. The result was a decay curve showing how long it takes for energy to drop by 60 dB in each band.
The impulse response is first filtered with a fourth-order Butterworth bandpass filter,

${IR}_{b} (t) = h_{b} (t) * IR (t),$

(16)

where $h_{b} (t)$ is the bandpass filter for center frequency $f_{b}$ and ∗ denotes the convolution operation. Then the smoothed energy envelope, $E_{b} (t)$ , is computed for each band-filtered signal by a Hann window of duration $β = 100 ms$

$E_{b} (t) = \sqrt{({IR}_{b} {(t)}^{2}) * w (t)}$

(17)

where $w (t)$ is a normalized Hann window and ∗ denotes convolution. This is known as the response energy profile (REP), and it is the root-mean-square energy of the impulse response over a sliding window.
The last $η = 200 ms$ of the response was used to estimate the noise floor using a linear fit to the REP

$20 {log}_{10} E_{b} (t) \approx θ_{0} + θ_{1} t$

(18)

The estimated noise floor level $ν_{b}$ is then given by

$ν_{b} = θ_{0} + θ_{1} \cdot (\frac{η}{1000})$

(19)

The algorithm selects a fitting interval where the signal decays from $ν_{b} + Δ_{0}$ to $ν_{b} + Δ_{1}$ , with $Δ_{0} = - 5 dB$ and $Δ_{1} = + 5 dB$ . Within this range, a linear fit is computed as follows:

$20 {log}_{10} E_{b} (t) = α + β t$

(20)

The reverberation time for band b is then calculated from the slope $β$ as

$T_{60_{b}} (f_{b}) = - \frac{60}{β}$

(21)

The output curve demonstrates the required time for the energy to decay 60 dB in each frequency band. An example of a decay time curve is shown in Figure 4d.
Power Spectra: To examine whether the synthesized IRs reproduced the spectral shape of the measured IRs, their power spectra were compared. The spectra were smoothed using a method based on Equivalent Rectangular Bandwidth (ERB) bands to reduce narrowband fluctuations and emphasize overall spectral balance. Smoothing makes it easier to compare wideband energy content and spectral characteristics. The smoothing method performs critical band smoothing on the magnitude response. Linearly spaced frequency bins are first grouped into Equivalent Rectangular Bandwidth (ERB) bands, and the magnitude spectrum is averaged over each all the bins in each band to obtain a smoothed magnitude spectrum.
Echogram and Normalized Echo Density: The normalized echo density (NED) was analyzed to evaluate how reflections accumulate over time. A theoretical model predicts how echo density evolves in a room, starting from a sparse distribution and growing toward a fully dense reverberant field. This model was compared to the measured sweep response using a moving window analysis, revealing how closely the measurement matched theoretical expectations. The theoretical model of a shoebox-shaped room produces a smooth curve that begins at zero (indicating no reflections) and asymptotically approaches one (a fully diffuse echo field). This behavior is governed by the room’s volume and the temporal resolution of the analysis.
The normalized echo density is modeled by the following equation [8]:

$NED (t) = \frac{t^{2}}{t^{2} + β}$

(22)

where

$β = \frac{1}{ρ}, ρ = \frac{4 π}{V} \cdot c^{3} \cdot d t$

(23)

Here, t represents time in seconds; $β$ is a scaling constant that controls how quickly the echo density increases; c is the speed of sound in air (343 m/s); V is the volume of the room (computed from its dimensions); and $d t$ is the temporal smoothing or resolution parameter in seconds.
The function $NED (t)$ captures the progression from early sparse reflections to a statistically dense reverberant field. As shown in Figure 4c, the echo density derived from the measured sweep response aligns closely with the modeled curve, supporting the accuracy and reliability of the sweep-based measurement. The normalized echo density profile of the initial four-channel sine sweep IR shows the number and distribution of reflection arrivals at the receiver using a 20 ms sliding time window. The following equation is used to compute the NED of an RIR over a moving window of time, where

$E_{k} = \sum_{n = k}^{k + W - 1} {|IR [n]|}^{2}$

(24)

$NED (t_{k}) = \frac{E_{k}}{max_{k} E_{k}}$

(25)

$NED (t_{k}) = \frac{1}{N_{\max}} \sum_{n = k}^{k + W - 1} [|IR [n]| > θ]$

(26)

Here, $IR [n]$ represents the discrete-time impulse response, and W is the analysis window length in samples. The term $E_{k}$ denotes the energy computed over a running window starting at sample index k. The threshold $θ$ is used in the count-based method to identify significant reflections, and $N_{\max}$ is the maximum number of samples exceeding the threshold in any window; it is used to normalize the count of echoes. The computed decay times shown in Figure 4d will be used in the comparative study as a baseline.

The sine sweep method derives the impulse response via deconvolution of an exponential sine sweep [55]. After computing the frequency responses of both the recorded signal

R (f)

and the original sweep

S (f)

, the inverse FFT is used to obtain the impulse response, where

IR (t) = F^{- 1} \{\frac{R (f)}{S (f)}\}

(27)

Since most of the synthesized impulse responses were monophonic, the mean across the four-channel A-format sine sweeps was calculated to generate a usable approximation of a nondirectional or omnidirectional IR. Figure 3 shows the four-channel response of the tetrahedral microphone and the averaged omnidirectional signal at source position I and receiver position H.

To validate the arrival times of early reflections, the recorded sine sweep response was compared with the arrival times computed using a script based on the Image Method for a shoebox-shaped room. The overlay revealed that the early reflection pattern of the sweep IR aligned well with the predictions from the image method. The light gray vertical lines in Figure 4b indicate the arrival times modeled. The spectrogram in Figure 4a shows a high signal-to-noise ratio and a full-bandwidth energy distribution.

Three important acoustic metrics—clarity (C80), definition (D50), and temporal center of mass (ts)—for impulse responses for different simulation methods are shown in Table 3, and they are compared to a reference measurement.

Clarity (C80) is the ratio of the energy of early (0–80 ms) to late-arriving sound. Larger values reflect better perception of the transient events. The SDN is unique with a high positive C80 value (2.59 dB), demonstrating a very high direct-to-reverberant-energy ratio and thus very good clarity, this is because the SDN has a drop in energy after the first-order reflections. However, the modal method and the ray-tracing approach result in lower clarity, which indicates longer or smeared reflections.

Definition: D50 quantifies the percentage of total energy that arrives within the first 50 ms of the impulse response. The SDN method yields the highest value at 56.09%, indicating a highly intelligible response. In contrast, the modal method produces the lowest value at 5.71%, suggesting a highly reverberant or “muddy” impulse response. This is consistent with its high center of gravity (COG).

The center of mass (ts) indicates when the energy is concentrated at the average. The SDN has the earliest center (82.06 ms), which indicates that high energy is primarily early and is consistent with high C80 and D50 values. In comparison the modal and balloon methods have much later energy centroids > 370 ms, which is more representative of a diffuse or sustained sound field.

The similarity of FEM-IM and the image method to measured IRs in all three metrics suggests true accuracy. Sabine-NED is also quite close, with only small differences in timing and clarity. The balloon response diverges significantly from the measured response, especially in terms of the definition and center of mass, which also match observations of the non-uniform response spectrum and response function.

Table 3 demonstrates the qualitative and perceptual differences described above among the simulation methods, as well as their respective successes and failures in reproducing the measured acoustical behavior.

4.2. Psychoacoustic Evaluation

4.2.1. Listening Test Methods

We elected to use a MUSHRA [56] listening test to evaluate the subjective similarity of the IRs. In the MUSHRA evaluation, participants are given several short audio clips to compare, with one labeled as the “reference” and all others unlabeled. The MUSHRA reference was selected to be the sine sweep recording IR. The anchor was a 1 kHz low-pass filtered version of the stimuli convolved with an IR taken from a very large space. The selected IR was from a 3500 m³ nuclear reactor hall from openairlib.net. (https://www.openair.hosted.york.ac.uk/?page_id=626, accessed on 3 July 2022).

Two anechoic stimuli were chosen, namely a 5-second anechoic recording of a female English speech and a 5-second recording of a mid-tempo electronic drum pattern, both obtained from openairlib.net. Each of the stimuli was then convolved with each of the 9 IRs, namely the sine sweep reference, anchor, balloon pop recorded IR, image method IR, ray trace hybrid IR, modal IR, FEM IR, SDN IR, and Sabine-NED IR. To remove loudness as a confounding factor, the resultant stimuli were then normalized to −20 dB LUFS using the pyloudnorm 0.1.1 library for Python 3.6 [57] before being presented in the listening test.

In this particular evaluation, participants were asked to rate on a scale of 1–100 how similar each of the unlabeled clips is to the labeled reference. Six unlabeled clips were present in each MUSHRA trial: the hidden reference (sine sweep recording), the anchor (1 kHZ low pass and large IR), the balloon pop IR, and a random subset of three of the six synthesized IRs, with the three IRs not selected appearing in the next MUSHRA. Therefore, this resulted in two IR sub-groups.

Three listener–source locations were selected from the locations outlined in Figure 2 to best represent various listening situations. Locations were either classified as on-axis, where the point is located at a point that might experience nodal or anti-nodal behavior, i.e., located at either

\frac{1}{2}, \frac{1}{3},

or

\frac{1}{4}

of the dimensions of the room, or off-axis where the point is not located at an on-axis point. The selected locations were On–On (Source C and Listener A), where both the listener and source were on-axis; On–Off (Source C and Listener K), where only the listener was at an on-axis location; and Off–Off (Source I and Listener J), where both the source and the listener are on off-axis locations.

With three angles of incident (On–On, On–Off, and Off–Off), two stimuli types (speech and drums), and two IR subgroups, the experiment consisted of 12 MUSHRA trials in total.

To complete the subjective listening test evaluation, 29 participants were recruited via department listservs to complete the listening test. Participants were compensated with a USD 15 electronic gift card at the conclusion of the study. Of the final 29 participants, all reported having no hearing impairments, with 51% self-identifying as female and 49% as male. Participants had a mean age of 30 ± 7.4 years, with 17% having no experience playing a musical instrument, 14% having 1–3 years of experience, 10% having 3–5 years of experience, and 59% having 5 or more years of experience. Participants were also asked how many years of experience they had receiving instruction or practicing technical listening skills, such as those taught in a studio recording or mastering class. A total of 55% reported no technical listening experience, 14% 1–3 years, 3% 3–5 years, and 28% 5+ years of technical listening experience.

Participants were excluded from the analysis if they completed more than 10% of the MUSHRA trials in less than 10 s and/or if they rated the hidden reference as more similar to the explicit reference in more than 10% of the trials. A total of 8 participants were excluded according to these criteria, resulting in 21 participants’ data being used for further analysis.

4.2.2. Listening Test Results

We performed an ANOVA to estimate the influence of the stimuli type (speech or drums), position (on–on, on–off, or off–off), and the method of IR synthesis or recording on participant ratings of similarity to the sine sweep IR reference. As summarized in Table 4, both the stimuli type and the IR method had statistically significant effects on MUSHRA ratings, whereas position did not. Notably, the voice stimuli was band-limited (approx. 150 Hz–12 kHz) in comparison to the drum stimuli (approx. 40 Hz–20 kHz).

Comparing MUSHRA ratings, as shown in Figure 5, three distinct groups of IR types are apparent. Image method and Sabine-NED IRs were not statistically different from the reference IR recording, followed by the FEM and SDN, which were statistically different with smaller differences compared to the final group. Ray tracing, balloon pop recording, and modal method IRs were not statistically different from the the anchor and showed a large, statistically significant difference compared to the reference IR.

Exploring the significant difference between stimuli types observed in the ANOVA analysis, it is evident that these differences are dependent on the IR method. As Figure 6 shows, FEM and modal IRs had a large, approximately 20%, and significant difference between stimuli types. Next, the image method, balloon pop, Sabine-NED, and anchor all had statistically significant, modestly sized differences that ranged from 7–11%. Finally, ray tracing, SDN, and the reference had no statistically significant difference between stimuli types. Comparing rating scores within each method by stimuli type, if there was a difference between stimuli types, speech always scored higher than drums. Averaged across all methods, speech had an average rating of

55.3 \pm 3.4

, whereas drums had a

46.4 \pm 3.2

average.

5. Discussion

5.1. Sine Sweep vs. Balloon Pop

Balloon pops and sine sweeps are both common methods for impulse response (IR) measurements, but sine sweeps are generally more reliable and accurate. Balloon pops emit sound with relatively uniform spatial radiation and require no equipment, but their acoustic quality is inconsistent—some pops are better than others, and the better ones will have variations in inflation pressure, the point at which the balloon is popped, how it is popped, equalization, and radiation pattern from pop to pop. Furthermore, balloon pops tend to be band-limited, which leads to wider pulses and shorter perceived mixing times, as can be seen from Figure 7, Figure 8 and Figure 9. These findings are reflected in the listening test results, where the balloon pop performed as poorly as the anchor due to its limited bandwidth.

Balloons are also not perfectly spherical, which creates non-uniform directionality and affects the equalization as a function of radiation direction. The time-limited duration of a balloon pop leads to zeros in the balloon pop spectrum, while the asymmetric nature of inflated balloons means that the spectral zeros are at different frequencies for different radiation directions. As a result, it is not possible to correct a balloon pop response by applying an equalization procedure, and significant variance between balloon pop impulses will exist.

The balloon pop’s loud, gunshot-like sound can be distressing or unsuitable in certain environments. By contrast, sine sweep recordings using speakers produce near full-bandwidth excitation, making them more robust to background noise and yielding more accurate IRs. They are also easier to repeat and less disruptive. However, the speaker must be carefully chosen to avoid excessive directivity; it is advisable to characterize its frequency response and radiation pattern beforehand. Even commercially available Bluetooth battery-powered speakers, such as the Bose SoundLink, can be adequate.

Although a balloon pop is an impulsive, broadband event that produces a sharp burst of energy across the audio band, the practical outcomes of using balloons to generate RIRs can vary. In summary, there are various factors that lead to dependencies and inconsistencies in RIRs generated through a balloon pop. For example, the diameter of the fully inflated balloon will govern the duration of the balloon pop, and therefore limit the effective bandwidth of the balloon pop. However, the relatively long duration of the balloon pop pulse (roughly a millisecond or so) means that a lower density of echoes is needed to achieve a perceivably noise-like late field in the normalized echo density (NED) profile. Therefore, this noise-like late field occurs earlier in the balloon pop response than in the loudspeaker-measured impulse response, as the loudspeaker response has a main pulse width of about a tenth of the duration of the balloon pop.

5.2. Comparison Between Methods

In this section, the performance of the different methods evaluated in this paper are compared against each other based on listening test results and objective metrics. The computational complexity vs accuracy trade-off of each method is shown in Table 5. Scalability with increasing geometric complexity and availability of the methods (in terms of cost and ease of implementation) is shown in Table 6. Links to paid/open-source software the reader can use to test the methods are provided. Ray tracing and the FEM are highly scalable with increasing complexity in geometries, since the underlying computing principles remain unchanged. IM is less scalable because visibility trees have to be computed for each virtual image source. The other methods are currently not scalable as they assume a strictly shoebox room. For example, assume a strictly spherical room, the eigenfunctions in the modal method (3) will change with room shape. The spherical spreading assumption used in the Sabine-NED method will break in a room with furniture or other architectural details, and the SDN will no longer yield accurate early reflections.

In the sections below, the outcome of the objective and subjective evaluations with the different impulse response generation methods is described.

5.2.1. Best Performing—Image Method and Sabine-NED Method

Objective analysis and listening test results corroborate that the image method (IM) and the Sabine-NED method perform closest to the reference. Chirp features are present in both measured RIR and IM, but the effect is amplified in IM IRs. These are not present in the Sabine-NED IR. However, these results are likely to only occur in simple geometries as the Sabine-NED method is unlikely to accurately simulate irregular geometries and absorption profiles.

The objective measures align well with those of the reference except for the following (which does not affect the perceptual results):

Mismatch in the low-frequency power spectrum of IM up to 100 Hz compared to reference, as seen in Figure 10a.
Misalignment of early reflections in the Sabine-NED method compared to the reference, as seen in Figure 11c, as it is agnostic to the source–receiver location. It is likely that a spatialized stimulus would have exposed the mismatch in time of arrivals in the Sabine-NED method.
Slower rise in NED of both methods compared to the reference, as seen in Figure 12a,c.

5.2.2. Worst Performing—Modal Method and Ray Tracing

Although the modal method has the closest match in NED to the reference, as seen in Figure 12d, it is band-limited to 8 kHz, as seen in Figure 10d, and also shows poor

T_{60}

fits to the reference. The peak at 10 kHz in Figure 13d is spurious and caused by the noise floor, but the T₆₀s in the low and mid-frequencies also do not align well with the reference. Including more high-frequency modes in the synthesis would have resulted in a better bandwidth.

Ray tracing shows multi-stage decay, with early decay that is a close match to the reverberation time of the reference RIR and late decay that is relatively constant at the longest decay time present across frequency, which is a poor match to the reference decay time, as seen in Figure 13a. This seems to result from ODEON switching from IM/ray tracing for the early reflections to using ray radiosity to simulate late reverberations. This is likely to have affected the listening test results. It also shows a mismatch in the power spectrum compared to the reference, as evident in Figure 10a. It has a much faster rise in NED compared to the reference in Figure 12a due to the dense diffuse arrivals, as seen in Figure 11a. Additionally, the hybrid ray-tracing methods used in this study are unable to run in real time. Other implementations of ray-tracing algorithms are needed for real-time contexts.

5.2.3. Other Methods—SDN and FEM-Hybrid

The FEM-IM hybrid method and SDN perform better than modal and ray tracing and worse than the IM and Sabine-NED methods in the listening test.

The SDN shows a good match to the power spectra (Figure 10b). However, because of too much air absorption applied in the implementation, the method has a much shorter reverberation time compared to the reference. Unfortunately, the error was only discovered after the listening test had been completed. Had the error not been present, the reverberation times would have aligned well between the measured and SDN simulations, much like the IM and Sabine-NED simulations.

The FEM-IM hybrid method shows a good match to the reference in terms of the early reflection pattern (Figure 11d) and

T_{60}

(Figure 13d) and has an NED similar to IM (Figure 12d). It matches the power spectra well in the lower frequencies up to 200 Hz (the Schroeder frequency was at 220 Hz) but shows a reduction of 5 dB beyond 200 Hz, compared to the reference, as seen in Figure 10d. Up to 220 Hz, the FEM-simulated IR is used, and beyond that the IM IR is stitched in. The EQ mismatch is likely caused by the method used to combine both the IRs. This EQ mismatch is reflected in the listening test results. It is to be noted that the FEM failed to simulate accurate IRs for source position B (middle of the room). This is because the source coincides with a pressure node for several modal frequencies. This means that the source is not exciting those modes, and the transfer function is zero at those frequencies. Additionally, it is noted that the FEM does not perform accurately in higher frequency ranges.

5.3. Stimulus Type: Speech vs. Drums

Across different stimulus types, speech consistently scored higher than drums in perceptual evaluations. This is because drum signals are highly transient, making them more sensitive to smearing effects, especially when impulse responses are not well localized in time. The impact of stimulus type varied significantly depending on the IR model: FEM and modal IRs showed a large difference in performance between speech and drums. The band-limited nature of modal IRs is much more noticeable with drums, which have a bandwidth extending to 12 kHz (crash cymbals). In comparison, speech is band-limited to 4 kHz. The FEM performs worse on drums because the sweeping echoes, which are more prominent with transient signals like drums, are not well captured by the FEM-IM hybrid method, which also used random displacements in the microphone position. This helps reduce the sweeping echo effect in the image method but eliminates it altogether in the FEM simulation.

On the other hand, the image method, the balloon pop, Sabine-NED, and the anchor condition all exhibited statistically significant but more modest differences ranging from 7–11%. Interestingly, these variations do not appear to be correlated with overall model performance—for example, the FEM and SDN scored similarly overall, yet only the FEM exhibited a noticeable stimulus-dependent effect.

6. Conclusions

This paper presented an introductory overview of RIR synthesis techniques and objective comparison metrics, as well as the results of both objective and perceptual comparisons of four categories of RIR synthesis techniques, namely wave-based methods (hybrid FEM and modal analysis), geometrical acoustics methods (the image method and ray tracing), delay-network reverberators (SDNs), and statistical methods (Sabine-NED). We compared these synthesis methods against manual RIR measurement techniques (sine sweep and balloon pop) in the context of a simple shoebox-shaped room. These comparisons illustrate that modal and ray tracing RIRs were perceptually the least similar to the sine sweep RIR, and this was likely due to the band-limited frequency response of the modal method and a poorly matched late reverberation profile of ray tracing in this simple geometry. Additionally, for non-spatialized perceptual evaluations, the image method and the Sabine-NED method were rated as most similar to the sine sweep RIRs, despite having mismatched low-frequency power spectra (image method) and misalignment of early reflections (Sabine-NED). Our work highlights the many advantages of sine sweep RIR generation over balloon pops, such as improved reproducibility and a more even frequency response. Finally, the differences in perceptual ratings of different RIRs by listening test stimuli type are highlighted. Specifically, drum loops, with sharp transients and wider frequency content, are able to highlight differences in smearing effects and frequency response mismatches when compared to using speech as a stimuli in RIR listening test evaluations.

Future work includes the generation of impulse responses with a greater variety of synthesis methods as well as generating impulse responses at a greater number of virtual locations within the simulated space. More complex spaces can be used for this future work and could evaluate the effect synthesis method’s sensitivity to factors like the contents and temperature of the simulated spaces (e.g., an empty cathedral versus one full of people or an auditorium on a hot summer’s day versus a cold winter’s morning).

Author Contributions

Conceptualization, L.M. and J.S.A.; methodology, L.M., N.F., O.D. and J.S.A.; software, L.M., N.F., O.D. and J.S.A.; validation, L.M., N.F., O.D. and J.S.A.; formal analysis, L.M., N.F., O.D. and J.S.A.; investigation, L.M., N.F., O.D. and J.S.A.; resources, L.M. and J.S.A.; data curation, L.M.; writing—original draft preparation, L.M., N.F., O.D. and J.S.A.; writing—review and editing, L.M., N.F., O.D. and J.S.A.; visualization, L.M., N.F. and J.S.A.; supervision, J.S.A.; project administration, L.M.; funding acquisition, L.M. and J.S.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted with support from a grant from the Templeton Religion Trust.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Stanford University Institutional Review Board (approval number #67488).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Recorded and generated impulse responses are available at https://zenodo.org/records/15739328, which was accessed on 25 August 2025.

Acknowledgments

We would like to thank Jonathan Berger for his advice and support, as well as Enzo Desena for his advice.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Savioja, L.; Huopaniemi, J.; Lokki, T.; Väänänen, R. Creating interactive virtual acoustic environments. J. Audio Eng. Soc. 1999, 47, 675–705. [Google Scholar]
Vorländer, M. Virtual acoustics. Arch. Acoust. 2014, 39, 307–318. [Google Scholar] [CrossRef]
Neidhardt, A.; Schneiderwind, C.; Klein, F. Perceptual matching of room acoustics for auditory augmented reality in small rooms-literature review and theoretical framework. Trends Hear. 2022, 26, 23312165221092919. [Google Scholar] [CrossRef]
Potter, T.; Cvetković, Z.; De Sena, E. On the Relative Importance of Visual and Spatial Audio Rendering on VR Immersion. Front. Signal Process. 2022, 2, 904866. [Google Scholar] [CrossRef]
Katz, B.F.G.; Murphy, D.; Farina, A. Exploring cultural heritage through acoustic digital reconstructions. Phys. Today 2020, 73, 32–37. [Google Scholar] [CrossRef]
Murphy, D.; Shelley, S.; Foteinou, A.; Brereton, J.; Daffern, H. Acoustic Heritage and Audio Creativity: The Creative Application of Sound in the Representation, Understanding and Experience of Past Environments. Internet Archaeol. 2017, 44. [Google Scholar] [CrossRef]
Kuttruff, H. Room Acoustics; CRC Press: Boca Raton, FL, USA, 2016. [Google Scholar]
Abel, J.S.; Huang, P. A simple, robust measure of reverberation echo density. In Proceedings of the Audio Engineering Society Convention 121, San Francisco, CA, USA, 5–8 October 2006. [Google Scholar]
Sabine, W.C. Collected Papers on Acoustics; Harvard University Press: Cambridge, MA, USA, 1922. [Google Scholar]
Siltanen, S.; Lokki, T.; Kiminki, S.; Savioja, L. The room acoustic rendering equation. J. Acoust. Soc. Am. 2007, 122, 1624–1635. [Google Scholar] [CrossRef] [PubMed]
Valimaki, V.; Parker, J.D.; Savioja, L.; Smith, J.O.; Abel, J.S. Fifty years of artificial reverberation. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 1421–1448. [Google Scholar] [CrossRef]
Savioja, L.; Svensson, U.P. Overview of geometrical room acoustic modeling techniques. J. Acoust. Soc. Am. 2015, 138, 708–730. [Google Scholar] [CrossRef] [PubMed]
Välimäki, V.; Parker, J.; Savioja, L.; Smith, J.O.; Abel, J. More than 50 years of artificial reverberation. In Proceedings of the Audio Engineering Society Conference: 60th International Conference: Dreams (Dereverberation and Reverberation of Audio, Music, and Speech), Leuven, Belgium, 3–5 February 2016; Audio Engineering Society: New York, NY, USA, 2016. [Google Scholar]
Hamilton, B.; Bilbao, S. FDTD methods for 3-D room acoustics simulation with high-order accuracy in space and time. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2112–2124. [Google Scholar] [CrossRef]
Gumerov, N.A.; Duraiswami, R. Fast multipole accelerated boundary element methods for room acoustics. J. Acoust. Soc. Am. 2021, 150, 1707–1720. [Google Scholar] [CrossRef]
Prinn, A.G. A review of finite element methods for room acoustics. Acoustics 2023, 5, 367–395. [Google Scholar] [CrossRef]
Saarelma, J.; Botts, J.; Hamilton, B.; Savioja, L. Audibility of dispersion error in room acoustic finite-difference time-domain simulation as a function of simulation distance. J. Acoust. Soc. Am. 2016, 139, 1822–1832. [Google Scholar] [CrossRef] [PubMed]
Meissner, M. Prediction of reverberant properties of enclosures via a method employing a modal representation of the room impulse response. Arch. Acoust. 2016, 41, 27–41. [Google Scholar] [CrossRef][Green Version]
Das, O.; Abel, J.S. Modal Estimation on a Warped Frequency Axis for Linear System Modeling. arXiv 2022, arXiv:2202.11192. [Google Scholar] [CrossRef]
Abel, J.S.; Coffin, S.; Spratt, K. A modal architecture for artificial reverberation with application to room acoustics modeling. In Proceedings of the Audio Engineering Society Convention, Los Angeles, CA, USA, 9–12 October 2014; Volume 137. [Google Scholar]
Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Borish, J. Extension of the image model to arbitrary polyhedra. J. Acoust. Soc. Am. 1984, 75, 1827–1836. [Google Scholar] [CrossRef]
McGovern, S.G. Fast image method for impulse response calculations of box-shaped rooms. Appl. Acoust. 2009, 70, 182–189. [Google Scholar] [CrossRef]
Krokstad, A.; Strom, S.; Sørsdal, S. Calculating the acoustical room response by the use of a ray tracing technique. J. Sound Vib. 1968, 8, 118–125. [Google Scholar] [CrossRef]
Pompei, A.; Sumbatyan, M.; Todorov, N. Computer models in room acoustics: The ray tracing method and the auralization algorithms. Acoust. Phys. 2009, 55, 821–831. [Google Scholar] [CrossRef]
Christensen, C.L.; Rindel, J.H. A New Scattering Method That Combines Roughness and Diffraction Effects. Acta Acust. United Acust. 2005, 91, 35–48. [Google Scholar] [CrossRef]
ODEON A/S. ODEON Room Acoustics Software, User Manual; ODEON A/S: Lyngby, Denmark, 2023. [Google Scholar]
Brinkmann, F.; Aspöck, L.; Ackermann, D.; Lepa, S.; Vorländer, M.; Weinzierl, S. A Round Robin on Room Acoustical Simulation and Auralization. J. Acoust. Soc. Am. 2019, 145, 2746–2760. [Google Scholar] [CrossRef] [PubMed]
Southern, A.; Siltanen, S.; Murphy, D.T.; Savioja, L. Room Impulse Response Synthesis and Validation Using a Hybrid Acoustic Model. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1940–1952. [Google Scholar] [CrossRef]
Rindel, J.H. The use of computer modeling in room acoustics. J. Vibroengineering 2000, 3, 219–224. [Google Scholar]
Pjeturson, O.H.; Pind, F.; Jeong, C.H. New Opportunities in Room Acoustics Simulations Using Wave Based Technology. In Proceedings of the Baltic-Nordic Acoustic Meeting, Hanasaari, Finland, 22–24 May 2024. [Google Scholar]
Jot, J.M.; Chaigne, A. Digital delay networks for designing artificial reverberators. Proc. Audio Eng. Soc. 1991, 1–12. [Google Scholar]
Välimäki, V.; Holm-Rasmussen, B.; Alary, B.; Lehtonen, H.M. Late reverberation synthesis using filtered velvet noise. Appl. Sci. 2017, 7, 483. [Google Scholar] [CrossRef]
Smith, J.O. A new approach to digital reverberation using closed waveguide networks. In Proceedings of the International Computer Music Conference, Vancouver, BC, Canada, 19–22 August 1985; pp. 47–53. [Google Scholar]
De Sena, E.; Hacıhabiboğlu, H.; Cvetković, Z.; Smith, J.O. Efficient synthesis of room acoustics via scattering delay networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1478–1492. [Google Scholar] [CrossRef]
Scerbo, M.; Das, O.; Friend, P.; De Sena, E. Higher-order Scattering Delay Networks for Artificial Reverberation. In Proceedings of the International Conference on Digital Audio Effects (DAFx), Vienna, Austria, 6–10 September 2022. [Google Scholar]
Vinceslas, L.; Scerbo, M.; Hacıhabiboğlu, H.; Cvetković, Z.; De Sena, E. Low-Complexity Higher Order Scattering Delay Networks. In Proceedings of the 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 22–25 October 2023; pp. 1–5. [Google Scholar]
Djordjevic, S.; Hacihabiboglu, H.; Cvetkovic, Z.; De Sena, E. Evaluation of the perceived naturalness of artificial reverberation algorithms. In Proceedings of the Audio Engineering Society Convention 148. Audio Engineering Society, Online, 2–5 June 2020. [Google Scholar]
Mi, H.; Kearney, G.; Daffern, H. Perceptual Similarities between Artificial Reverberation Algorithms and Real Reverberation. Appl. Sci. 2023, 13, 840. [Google Scholar] [CrossRef]
Kj, C.; Volk, C.P.; Jeong, C.H. Perceptual Evaluation of Room Acoustic Simulations and Measurements. In Proceedings of the Baltic-Nordic Acoustic Meeting, Hanasaari, Finland, 22–24 May 2024. [Google Scholar]
Schoeffler, M.; Bartoschek, S.; Stöter, F.R.; Roess, M.; Westphal, S.; Edler, B.; Herre, J. webMUSHRA—A Comprehensive Framework for Web-based Listening Tests. J. Open Res. Softw. 2018, 6, 8. [Google Scholar] [CrossRef]
International Organization for Standardization. ISO 9613-1:1993—Acoustics—Attenuation of Sound During Propagation Outdoors—Part 1: Calculation of the Absorption of Sound by the Atmosphere. 1993. Available online: https://www.iso.org/standard/20649.html (accessed on 13 May 2025).
Härmä, A.; Karjalainen, M.; Savioja, L.; Välimäki, V.; Laine, U.K.; Huopaniemi, J. Frequency-warped signal processing for audio applications. J. Audio Eng. Soc. 2000, 48, 1011–1031. [Google Scholar]
De Sena, E.; Antonello, N.; Moonen, M.; Van Waterschoot, T. On the modeling of rectangular geometries in room acoustic simulations. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 774–786. [Google Scholar] [CrossRef]
Lam, Y.W. Issues for computer modelling of room acoustics in non-concert hall settings. Acoust. Sci. Technol. 2005, 26, 145–155. [Google Scholar] [CrossRef]
Das, O.; De Sena, E. The Complex Image Method for Simulating Wave Scattering in Room Acoustics. In Proceedings of the 2023 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 5–7 September 2023; pp. 1–7. [Google Scholar]
Kristensen, S.D.; Rasmussen, B. Repeatability and Reproducibility of Sound Insulation Measurements. Danmarks Tekniske Universitet. Report No. 118. 1984. Available online: https://vbn.aau.dk/ws/portalfiles/portal/75719936/RoundRobin_SoundInsulationWindows_Repeatability_Reproducibility_LI_TR_118_1984SDK_BiR_.pdf (accessed on 18 August 2025).
Harris, D.A. Noise control manual. Van Nostrand Reinhold 1991, 3, 45–53. [Google Scholar]
Smits, J.; Kosten, C. Sound absorption by slit resonators. Acta Acust. United Acust. 1951, 1, 114–122. [Google Scholar]
Das, O.; Calamia, P.; Gari, S.V.A. Room impulse response interpolation from a sparse set of measurements using a modal architecture. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 960–964. [Google Scholar]
Mitchell, A. An introduction to the mathematics of the finite element method. In Proceedings of the The Mathematics of Finite Elements and Applications: Proceedings of the Brunel University Conference of the Institute of Mathematics and Its Applications Held in April 1972; Academic Press: Oxford, UK, 1972; p. 37. [Google Scholar]
Hecht, F. New development in FreeFem++. J. Numer. Math. 2012, 20, 251–266. [Google Scholar] [CrossRef]
Tronchin, L. Variability of room acoustic parameters with thermo-hygrometric conditions. Appl. Acoust. 2021, 177, 107933. [Google Scholar] [CrossRef]
Bradley, J. The sound absorption of occupied auditorium seating. J. Acoust. Soc. Am. 1996, 99, 990–995. [Google Scholar] [CrossRef]
Angelo, F. Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique. Paper No. 5093. 2000. Available online: https://aes2.org/publications/elibrary-page/?id=10211 (accessed on 18 August 2025).
Series, B. Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems. 2014. Available online: https://www.itu.int/rec/R-REC-BS.1534-3-201510-I/en (accessed on 18 August 2025).
Steinmetz, C.J.; Reiss, J.D. pyloudnorm: A simple yet flexible loudness meter in Python. In Proceedings of the 150th AES Convention, Online, 25–28 May 2021. [Google Scholar]

Figure 1. Pictures of the racquetball court in which impulse responses were measured. (a) Image of the racquetball court taken from the door, facing the back wall. (b) Image of the racquetball court taken from the back wall, facing the door.

Figure 2. Position diagram of all positions used as source and/or listener locations. All values are given in meters. Evenly dotted blue lines represent divisions that would separate the room into even quarters, were as the irregularly dotted orange lines divide the room into thirds.

Figure 3. Resultant impulse response of a sine sweep measurement with the source at position I and the receiver at position H. (Top) A-format impulse response of the summation of all four channels of the tetramic. (Bottom) Spectrograms of each of the individual channels of the tetramic.

Figure 4. Measurement of a sine sweep at source position I and receiver position H. (a) Spectrogram of the measured impulse response, a good signal-to-noise-ratio, and broadband content. (b) Modeled and measured arrival times of early reflections with an image source model. The thin black vertical lines represents the modeled arrival times whereas the four colored lines represent the measured signal by each of the four channels. (c) Measured echogram and echo density normalized to 0–1, showing the transition from sparse to dense reverberations. Black line represents the averaged modeled response whereas each of four colored lines represents an individual chanel of the IR measurement. (d) Frequency-dependent decay times (

T_{60_{b}}

) characterizing energy decay across frequency bands.

Figure 4. Measurement of a sine sweep at source position I and receiver position H. (a) Spectrogram of the measured impulse response, a good signal-to-noise-ratio, and broadband content. (b) Modeled and measured arrival times of early reflections with an image source model. The thin black vertical lines represents the modeled arrival times whereas the four colored lines represent the measured signal by each of the four channels. (c) Measured echogram and echo density normalized to 0–1, showing the transition from sparse to dense reverberations. Black line represents the averaged modeled response whereas each of four colored lines represents an individual chanel of the IR measurement. (d) Frequency-dependent decay times (

T_{60_{b}}

) characterizing energy decay across frequency bands.

Figure 5. A violin plot showing the distributions of MUSHRA ratings by IR type. All IR types grouped together by a solid black bar had no significant difference, while groups connected by a grey bar are all statistically significantly different from each other. The significance level is denoted as follows: * indicates

p < 0.05

, and ** indicates

p < 0.01

.

Figure 5. A violin plot showing the distributions of MUSHRA ratings by IR type. All IR types grouped together by a solid black bar had no significant difference, while groups connected by a grey bar are all statistically significantly different from each other. The significance level is denoted as follows: * indicates

p < 0.05

, and ** indicates

p < 0.01

.

Figure 6. Comparison of the difference in mean ratings in stimuli type (speech

μ

- drums

μ

) by IR type. The significance level is denoted as follows: ** indicates

p < 0.01

, and *** indicates

p < 0.001

.

Figure 6. Comparison of the difference in mean ratings in stimuli type (speech

μ

- drums

μ

) by IR type. The significance level is denoted as follows: ** indicates

p < 0.01

, and *** indicates

p < 0.001

.

Figure 7. Comparison of spectrograms to two different measurement methods at source position B and receiver position A. (a) The 10 s sine sweep measurement illustrates clear broadband content and early reflections. (b) The balloon pop measurement is band-limited with a shorter energy distribution. The sweep method exhibits more continuous energy across frequency bands.

Figure 8. Comparison of power spectra between sine sweep and balloon pop IRs. (a) Source B–Receiver A: The sine sweep shows a consistent spectral shape, while the balloon pop exhibits high-frequency loss. (b) Source B–Receiver K: A similar trend with the balloon pop showing greater spectral inconsistencies.

Figure 9. Echogram (top) and Normalized Echo Density (bottom) comparison between sine sweep and balloon pop IRs. (a) Source B–Receiver A. (b) Source B–Receiver K. The balloon pop shows faster NED rise but a less stable tail, indicating premature echo field saturation. The sweep data align more closely with theoretical echo density evolution.

Figure 10. Power spectrum comparison between measured and synthesized IRs (Source C–Receiver A). (a) Ray tracing overestimates low–medium frequencies. (b) The SDN closely matches the reference except for air absorption roll-off. (c) Sabine-NED shows balanced energy but lacks spatial cues. (d) FEM-hybrid aligns well under 200 Hz but drop offs above 200 Hz due to blending artifacts.

Figure 11. Arrival time comparisons for synthesized IRs (Source C–Receiver A). (a) Ray tracing shows dense but smeared early reflections. (b) The delay network shows sparse and delayed early arrivals. (c) The Sabine-NED method lacks spatial accuracy and early structure. (d) Hybrid FEM matches early reflection patterns closely with the measured response.

Figure 12. Echogram and normalized echo density (NED) of synthesized IRs (Source C–Receiver A). (a) Ray tracing shows faster NED rise, suggesting smeared energy. (b) SDN has slower NED buildup. (c) Sabine-NED gradually approaches a dense field. (d) FEM-hybrid closely follows the measured NED trend.

Figure 13. Frequency-dependent decay time curves (

T_{60_{b}}

) for four synthesis methods, from source C to receiver A. (a) Geometrical method. (b) Delay network model. (c) Statistical method. (d) Wave-based simulation.

T_{60_{b}}

(band decay) curves of synthesized IRs compared with the reference (Source C–Receiver A). (a) Ray tracing shows deviation in the late decay region. (b) The SDN underestimates reverberation time across all bands. (c) Sabine-NED shows poor frequency resolution but perceptually plausible decay. (d) FEM-hybrid aligns well in low/mid frequencies but shows EQ mismatch at higher bands.

Figure 13. Frequency-dependent decay time curves (

T_{60_{b}}

) for four synthesis methods, from source C to receiver A. (a) Geometrical method. (b) Delay network model. (c) Statistical method. (d) Wave-based simulation.

T_{60_{b}}

(band decay) curves of synthesized IRs compared with the reference (Source C–Receiver A). (a) Ray tracing shows deviation in the late decay region. (b) The SDN underestimates reverberation time across all bands. (c) Sabine-NED shows poor frequency resolution but perceptually plausible decay. (d) FEM-hybrid aligns well in low/mid frequencies but shows EQ mismatch at higher bands.

Table 1. Absorption coefficients per frequency band.

Material Type	63 Hz	125 Hz	250 Hz	500 Hz	1000 Hz	2000 Hz	4000 Hz
Wooden Floor on Joist	0.15	0.15	0.11	0.10	0.07	0.06	0.07
Wall Material (Hard Wooden Board)	0.06	0.06	010	0.08	0.09	0.07	0.04
Large Panels of Heavy Glass	0.18	0.18	0.06	0.04	0.03	0.02	0.02
Ventilation Grille	0.60	0.60	0.60	0.60	0.60	0.60	0.60
Glass Window	0.35	0.35	0.25	0.18	0.12	0.07	0.04

Table 2. ODEON simulation settings.

Parameter	Value/Description
ODEON Version	17
Source Position (x, y, z)	Similar to the ground truth
Source Type	Omnidirectional
Receiver Position(s)	Similar to the ground truth
Receiver Type	Binaural
Speaker Directivity Pattern	Omni
Impulse Response Length	7600 ms
Number of Late Rays	16,000
Calculation Mode	Precision Mode (Odeon-suggested values)
Early Reflection Transition Order	2
Maximum Reflection Order	10,000
Impulse Resolution	1 ms
Number of Early Scatter Rays per Image Source	100
Calculation Method	Angular absorption only for soft materials
Screen Diffraction	Enabled
Scattering Method	Explicit scattering coefficients assigned to each material or surface
Reflection-Based Scatter	Enabled
Key Diffraction Frequency	707 Hz

Table 3. Acoustic Metrics for Simulated and Measured IRs.

Methods	C80 (dB)	D50 (%)	Center of Mass (ms)
FEM Hybrid	−3.05	23.74	298.55
Sabine-NED	−3.84	20.07	322.94
Modal	−9.27	5.71	446.43
Ray Tracing Hybrid	−5.88	11.90	364.54
Image Method	−3.36	22.35	312.95
SDN	2.59	56.09	82.06
Balloon Pop Measurement	−6.47	11.24	371.35
Sine Sweep Measurement	−4.03	18.60	283.23

Table 4. Results of an ANOVA analysis measuring the influence of the stimuli type, position, and IR method on MUSHRA ratings.

Variable	F-Statistic	p-Value
Stimuli	30.1388	$4.71 \times 10^{- 8}$
Position	1.1860	0.3057
IR Method	367.4879	$3.4 \times 10^{- 10}$

Table 5. Summary of acoustic and computational considerations for the four virtual IR generation methods for shoebox rooms.

Method Type	Method Name	Computational Requirements	Accuracy
Geometrical acoustics	Ray tracing	Medium	High
	Image source model	Low	High
Wave-based	FEM	Highest	Highest
	Modal	High	High
Statistical	Sabine-NED	Low	Medium
Delay network	Scattering delay network	Lowest	Medium

Table 6. Summary of scalability and availability of the four virtual IR generation methods.

Method Type	Method Name	Scalability with Geometric Complexity	Availability
Geometrical acoustics	Ray tracing	Highly scalable	Paid—ODEON
Geometrical acoustics	Image source model	Moderately scalable (needs computation of visibility tree)	Free—Pyroomacoustics (https://pyroomacoustics.readthedocs.io/en/pypi-release/pyroomacoustics.room.html, accessed on 17 November 2024) (shoebox), Free—https://github.com/audiolabs/DEISM, accessed on 18 August 2025 DEISM (arbitrary geometry)
Wave-based	FEM	Highly scalable	Free—FreeFEM++ (https://freefem.org/, accessed on 17 November 2024) requires GPU for fastest computation.
Wave-based	Modal	Not scalable	Easy to implement.
Statistical	Sabine-NED	Not scalable	Easy to implement.
Delay network	Scattering delay network	Currently not scalable	Free—RealTime-SDN plugin (https://github.com/LIMUNIMI/Real-time-SDN, accessed on 23 August 2024)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

May, L.; Farzaneh, N.; Das, O.; Abel, J.S. Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room. Acoustics 2025, 7, 56. https://doi.org/10.3390/acoustics7030056

AMA Style

May L, Farzaneh N, Das O, Abel JS. Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room. Acoustics. 2025; 7(3):56. https://doi.org/10.3390/acoustics7030056

Chicago/Turabian Style

May, Lloyd, Nima Farzaneh, Orchisama Das, and Jonathan S. Abel. 2025. "Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room" Acoustics 7, no. 3: 56. https://doi.org/10.3390/acoustics7030056

APA Style

May, L., Farzaneh, N., Das, O., & Abel, J. S. (2025). Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room. Acoustics, 7(3), 56. https://doi.org/10.3390/acoustics7030056

Article Menu

Comparison of Impulse Response Generation Methods for a Simple Shoebox-Shaped Room

Abstract

1. Introduction

2. Ground-Truth IR Recordings

3. IR Synthesis Techniques

3.1. Image Method

3.2. Ray-Tracing Hybrid

3.2.1. Three-Dimensional Model and Simulation Parameters

3.2.2. Material Absorption Coefficients

3.3. Modal Analysis Method

3.4. Finite Element Analysis Hybrid Method

3.5. Scattering Delay Networks

3.6. Sabine-NED Method

4. Evaluation and Results

4.1. Objective Evaluation

Objective Measures

4.2. Psychoacoustic Evaluation

4.2.1. Listening Test Methods

4.2.2. Listening Test Results

5. Discussion

5.1. Sine Sweep vs. Balloon Pop

5.2. Comparison Between Methods

5.2.1. Best Performing—Image Method and Sabine-NED Method

5.2.2. Worst Performing—Modal Method and Ray Tracing

5.2.3. Other Methods—SDN and FEM-Hybrid

5.3. Stimulus Type: Speech vs. Drums

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI