Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments

Fichna, Stefan; van de Par, Steven; Seeber, Bernhard U.; Ewert, Stephan D.

doi:10.3390/acoustics8010009

Open AccessArticle

Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments

by

Stefan Fichna

^1,*

,

Steven van de Par

²,

Bernhard U. Seeber

³ and

Stephan D. Ewert

¹

Medizinische Physik and Cluster of Excellence, Carl von Ossietzky University Oldenburg, 26129 Oldenburg, Germany

²

Akustik and Cluster of Excellence Hearing4all, Carl von Ossietzky University Oldenburg, 26129 Oldenburg, Germany

³

Audio Information Processing, Technical University of Munich, 80333 Munich, Germany

^*

Author to whom correspondence should be addressed.

Acoustics 2026, 8(1), 9; https://doi.org/10.3390/acoustics8010009

Submission received: 31 October 2025 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 30 January 2026

Download

Browse Figures

Versions Notes

Abstract

Virtual acoustics enables the creation and simulation of realistic and ecologically valid indoor environments vital for hearing research and audiology. For real-time applications, room acoustics simulation requires simplifications. However, the acoustic level of detail (ALOD) necessary to capture all perceptually relevant effects remains unclear. This study examines the impact of varying ALOD in simulations of three real environments: a living room with a coupled kitchen, a pub, and an underground station. ALOD was varied by generating different numbers of image sources for early reflections, or by excluding geometrical room details specific for each environment. Simulations were perceptually evaluated using headphones in comparison to measured, real binaural room impulse responses, or by using loudspeakers. The perceived overall difference, spatial audio quality differences, plausibility, speech intelligibility, and externalization were assessed. A transient pulse, an electric bass, and a speech token were used as stimuli. The results demonstrate that considerable reductions in acoustic level of detail are perceptually acceptable for communication-oriented scenarios. Speech intelligibility was robust across ALOD levels, whereas broadband transient stimuli revealed increased sensitivity to simplifications. High-ALOD simulations yielded plausibility and externalization ratings comparable to real-room recordings under both headphone and loudspeaker reproduction.

Keywords:

room acoustics simulation; spatial audio quality; externalization; plausibility; speech intelligibility; virtual acoustics

1. Introduction

Room acoustics simulation and virtual acoustics are widely applied, ranging from generating highly controllable complex acoustic environments for hearing research and audiology [1,2,3,4] to entertainment and video game sound design [5,6], augmented reality (AR) or virtual reality (VR), as well as architectural planning [7]. Several room acoustics simulation approaches exist [8,9,10,11,12,13,14], each involving simplifications of the underlying acoustical processes. Many algorithms [10,11] simulate early reflections using the image source model (ISM) [15], combined with an alternative method to simulate late reverberation, such as ray tracing [16], feedback delay networks [17] extended to spatial rendering [11,13,18], scattering delay networks [19], and acoustic radiance transfer [14,20]. State-of-the-art methods achieve a high degree of perceptual plausibility [21,22], defined as “a simulation in agreement with the listener’s expectation towards a corresponding real event” according to Lindau and Weinzierl [23].

With regard to interactive real-time applications, it is of particular interest to reduce the computational demands of room acoustics simulation while maintaining perceptual plausibility and achieving a certain agreement in spatial audio quality and speech intelligibility (SI) between simulation and the respective real-life environment. For this, the ALOD of the simulated environment might be reduced if perceptual differences remain below established perceptual relevance thresholds reported in the literature. For example, changes in speech intelligibility on the order of approximately 1 dB in speech reception threshold are commonly regarded as perceptually meaningful, e.g., [24,25], whereas smaller deviations are often considered negligible. Similarly, regarding audio quality, studies suggest that broadband spectral deviations of about 1 dB or less typically remain below just-noticeable difference thresholds under many listening conditions, e.g., [26,27].

The term ALOD is inspired by level of detail (LOD) as commonly used in computer graphics, to denote typically distance-dependent reductions in visual geometric model complexity (polygon count) while maintaining the same perceptual impression. Prior studies have examined the influence of ALOD on acoustic parameters such as reverberation time and the speech transmission index [28]. Abd Jalil et al. [29] demonstrated that the number of surfaces in acoustic models can be reduced by up to 80% without compromising accuracy. Moreover, it has been shown that even when simplifying more complex room shapes to a proxy rectangular (“shoebox”) geometry [11,13,18], it is possible to maintain perceptual plausibility and achieve close alignment with measured reference conditions [21,22,30]. This level of simplification not only facilitates low-latency, interactive virtual acoustic environments for hearing research but also allows for the possibility of running realistic spatial rendering on mobile hardware, which is advantageous for fitting hearables and hearing support systems [31,32]. To ensure broad applicability of such simulations, it is important to evaluate perceptual outcomes using both headphone-based and loudspeaker-based reproduction methods. While headphone reproduction allows precise control over binaural cues, loudspeaker setups may be more suitable in scenarios where head tracking is unavailable or when simulating open-fit hearing devices. It has been shown [4] that shoebox simulations can yield good agreement with real-room references in terms of SI under both rendering approaches. However, it remains unclear whether certain simplifications in the acoustic model, such as omitting nearby reflective surfaces or coupled volumes, impact perception differently depending on the reproduction system.

Beyond room-acoustic simulation research, perceptual evaluation of complex acoustic environments has been extensively addressed in the field of soundscape research. Soundscape studies explicitly focus on perceptual, affective, and contextual dimensions of acoustic environments and commonly rely on structured listening experiments to assess attributes such as realism, comfort, spatial impression, and overall environmental quality [33,34,35]. Within this research area, indoor soundscapes and everyday listening situations have received increasing attention. For example, Toressin et al. [36] investigated indoor soundscapes under different ventilation conditions in residential buildings. Audio–visual interactions have also been shown to play a significant role in shaping soundscape perception and environmental satisfaction, as demonstrated by Jeon et al. [37] in an urban context.

Methodologically, a variety of headphone- and headset-based reproduction approaches have been employed in soundscape and immersive audio research to enable perceptual evaluation across different experimental settings. Fusaro et al. [38] compared laboratory-based and online soundscape assessment methods using headphone reproduction, demonstrating that perceptual soundscape characterization can be obtained across different study formats when listening conditions and playback devices are documented and appropriately accounted for. Related work has further shown that perceptual audio evaluation can be extended beyond strictly controlled laboratory environments using online or hybrid listening-test methodologies, allowing for larger and more diverse listener samples [39,40].

These soundscape-oriented approaches typically aim to capture global perceptual impressions and contextual responses in everyday audio-visual scenarios. In contrast, the present study adopts a complementary perspective by focusing on controlled manipulations of acoustic simulation parameters, enabling a systematic investigation of how specific reductions in ALOD affect perceptual quality and task-based performance under defined reproduction conditions.

Fichna et al. [41] investigated the perception of simulated and real acoustic scenes with varying ALOD using headphones. Their preliminary study examined SI in conditions with a masking signal and evaluated the overall perceived spatial audio quality using different ALOD in the acoustic simulation in comparison with recordings for a spectrally pink-colored pulse as a target signal. However, the study was based on only three participants (N = 3) and, therefore, served primarily as a proof-of-concept rather than allowing statistically robust or generalizable conclusions. Fichna et al. [42] extended this research by systematically evaluating plausibility, the overall perceived difference between the renderings, and the perceived externalization with headphones and loudspeakers. They used speech as well as a pink-colored pulse as target sounds. Their data showed that simulations with a maximum ALOD were as plausible as those generated using measured binaural room impulse responses (BRIRs), even with the transient pulse stimuli, which generally provide the highest sensitivity for perceptual differences in room acoustics simulation (see, e.g., [18]). Despite high plausibility ratings observed in [42], a considerable perceived overall difference between simulated and measured conditions remained. However, it remains unclear to which perceptual attributes these differences are related, and an overall picture including SI, plausibility, and spatial audio quality attributes is missing. Conclusions regarding SI are limited by the relatively small number of subjects in [41].

Recently, Martin et al. [43] investigated the implications of simplifying geometrical acoustics by reducing a highly resolved surface model of a real home environment to a shoebox representation. Using the CATT Acoustics software [44], which combines image-source and ray-tracing methods, their results suggested that geometric decimation had less impact on the perception of reverberation than reductions in the frequency resolution of absorption coefficients. In addition to plausibility and SI, perceptual dimensions, such as spatial audio quality and externalization, are essential for evaluating the realism of room-acoustic simulations. Spatial audio quality encompasses attributes like source distance, reverberation, and timbral coloration (see [45]), while externalization describes the extent to which sounds are perceived as located outside the listener’s head.

Taken together, previous studies demonstrate that considerable simplifications in acoustic geometry and simulation detail can be perceptually acceptable, particularly with regard to plausibility and SI. However, prior work has typically focused on individual environments, specific target signals, or isolated reproduction methods, limiting the ability to generalize findings across settings. Moreover, existing data are often based on separate listener groups and narrow perceptual attributes, making it difficult to assess how ALOD reductions affect different reproduction methods and room types in a unified framework.

The present study addresses this gap by systematically evaluating perceptual consequences of ALOD across three virtual environments using both headphone- and loudspeaker-based reproduction within a single group of participants. Specifically, we extend our data collected in [42] by testing SI and spatial audio quality in the same group of listeners and conditions. Additionally, we extend the stimuli to include a music token for perceived overall difference, and commonly analyze newly recorded and existing data in the context of the resulting broad range of perceptual measures: A comprehensive repeated-measures analysis across environments, rendering conditions, stimuli, and reproduction methods provides a statistically grounded basis for assessing systematic dependencies and interactions. Integration of these methodological extensions into a coherent interpretative framework enables a more differentiated examination and discussion of the perceptual consequences of acoustic model simplifications over [42]. The resulting study design allows for direct comparisons of ALOD effects across simulation setups and supports a better understanding of which simplifications can be tolerated without compromising perceptual validity.

Given that a reduction of ALOD may affect different perceptual dimensions in different ways, we employ a set of complementary perceptual measures to assess both functional and perceptual consequences of room-acoustic model simplifications: Speech intelligibility represents a task-based, functional measure that is particularly relevant for hearing research and applications involving communication in everyday acoustic environments. Plausibility was included as a global judgment reflecting whether a simulated scene meets listeners’ expectations of a corresponding real environment, while an overall perceived difference measure was employed as a more stringent test of perceptual fidelity, targeting authenticity in the sense of how closely a rendering matches a real-room reference rather than whether it merely appears plausible. To interpret potential overall differences in a more diagnostic manner, spatial audio perception was further assessed using defined sound quality items (e.g., timbral coloration, reverberation-related impressions, spatial clarity/width, and envelopment). Externalization was included as a key spatial percept that is known to be sensitive to inaccuracies in binaural and room-acoustic cues and is, therefore, particularly informative for evaluating spatial realism under different reproduction methods.

Three distinct and acoustically diverse environments were investigated: BRIRs recorded in a living room coupled with a kitchen, a pub, and an underground station [46] were used as references for headphone-based auralization. Additionally, to assess the reproduction method, perceptual assessments of spatial audio quality and externalization were conducted using a three-dimensional loudspeaker array in addition to headphone reproduction. The Room Acoustics SimulatoR (RAZR) [11,13] was employed to generate synthetic BRIRs and loudspeaker reproductions.

The role of ALOD has so far often been studied using decimation of the underlying geometric model, e.g., [29,43]. While this modulates several properties of the room acoustics simulation simultaneously, such as the number and spatial direction of early reflections, the temporal increase in echo density as well as spatial diffusion and reverberation time, the current approach aims to modulate specific, potentially important features of the simulated room impulse response in an isolated and controlled way. First, we start off with a considerably lower ALOD regarding the geometrical detail than the above studies, given that the underlying room geometry is a “proxy” rectangular (shoebox) room. With our highest ALOD model, we do not intend to model all geometrical details; however, the model has been fitted to match the frequency-dependent reverberation time and the long-term spectrum of the targeted real-world (B)RIR, important known perceptual features for comparing room acoustics. Moreover, the model has been demonstrated earlier, e.g., [18], to yield a realistic increase in echo density by incorporating sound scattering and natural-sounding (spatial) late reverberation. In order to reduce ALOD, we specifically vary selected room simulation features, such as the presence of scattering or dual-slope decays, or the inclusion of early reflections from nearby geometrical structures. Thus, ALOD was defined from a more perceptual perspective in the current study, rather than from a geometrical one defined by polygon count. The goal was to determine how ALOD can be reduced to the lowest perceptually acceptable level, depending on source stimulus and acoustic environment, starting from an already simplified and computationally efficient model. The highest ALOD condition included all available RAZR features, such as simplified effects of scattering and diffraction [18], while lower ALOD levels were created by successively omitting these components. Additionally, the effects of simulating nearby finite reflective surfaces, e.g., [47,48], and coupled volumes [13] were assessed, depending on the presence of that specific feature in the environment. This perceptually motivated the definition of ALOD which provides a systematic framework for assessing how different aspects of acoustic modeling complexity influence perceptual outcomes.

To motivate the choice of stimuli and perceptual tasks in this study, it is important to consider that different auditory attributes may vary in their sensitivity to changes in room-acoustic modeling fidelity. Previous research has shown that stimuli with rich spatial or spectral characteristics, such as music signals or transient noise bursts, tend to be more sensitive to rendering inaccuracies than typical speech signals, which are dominated by temporal envelope cues and linguistic redundancy [49,50]. This suggests that the perceptual detectability of reduced ALOD depends not only on the simulation method but also on the acoustic features of the source stimulus.

In addition, perceptual dimensions related to spatial quality, such as externalization and spatial clarity, were expected to place greater demands on acoustic modeling detail than task-based performance measures like speech intelligibility [23,51]. Speech, a transient pulse, and a spectrally rich musical stimulus are employed to probe stimulus-dependent sensitivity to ALOD, with the expectation that speech is comparatively robust, whereas broadband and transient signals reveal increased sensitivity to modeling simplifications. Together, this stimulus selection enables a differentiated investigation of how reductions in ALOD interact with source characteristics and perceptual outcome measures.

The study focuses on four main research topics:

(i) Effect of ALOD variation on SI. It is expected that SI will remain largely stable across different ALOD levels, as long as early reflections and diffuse reverberation are sufficiently represented [2,4,29,52]. However, excessive reductions in ALOD may degrade SI due to insufficient representation of early reflections that improve effective signal-to-noise ratio [53].

(ii) Effect of source stimulus on ALOD. Because different stimuli emphasize distinct acoustic cues, it is expected that transient and broadband signals (e.g., pulses) will reveal rendering differences more clearly than spectrally limited or speech-like signals, which may remain perceptually stable under reduced ALOD [28,54].

(iii) Contribution of specific spatial audio quality items on differences in perceived overall sound quality. Differences could primarily be influenced by spectral coloration, reverberation characteristics, and spatial coherence, cf. [55].

(iv) Effect of playback method (headphones vs. loudspeakers). It is expected that loudspeaker playback yields higher externalization ratings than headphone playback due to additional binaural and room cues, cf. [56,57]. However, in headphone-based simulations, externalization can still be influenced by factors such as the realism of the simulated room acoustics, including the number and spatial distribution of reflections, the balance between direct sound and early reflections, and the accuracy of spatial cues rendered through Head-Related Transfer Functions (HRTFs).

2. Methods

2.1. Participants

The experiment involved eight participants with normal hearing, aged between 21 and 33 years with a mean of 28.25 years and a standard deviation of 5.85 years. All participants had prior experience in psychoacoustic tests and received an hourly compensation. For SI tests, experience varied between none and multiple hours. Two of the participants can be considered expert listeners as defined by Kreiman et al. [58], due to them either working as a scientific researcher in audiology or by having experience in listening tests for at least 4 years on a regular basis.

A sensitivity power analysis was conducted using G*Power 3.1.9.7 for a repeated-measures ANOVA with two measurements (test–retest; within-subject factor), assuming α = 0.05 and power (1 − β) = 0.80. Given the available sample size (N = 8), the analysis assumed a minimum acceptable correlation among repeated measures of r = 0.80, reflecting a predefined quality criterion for test–retest reliability; measurements not meeting this criterion were repeated. Under these assumptions, the minimum detectable effect size was Cohen’s f = 0.37 (ε = 1). Accordingly, the present study was sufficiently powered to detect medium to large within-subject effects, whereas smaller effects may not have been detectable with the available sample size.

2.2. Acoustic Scenes

Three everyday acoustic scenes were generated using dummy head recordings from three real-world environments, as depicted in Figure 1 and further detailed in [46].

The first scene (left-hand column in Figure 1) occurred in a furnished Living Room (Living Room laboratory established at the University of Oldenburg). The dimensions are 4.97 m × 3.78 m × 2.71 m, corresponding to a volume of 50.91 m³ and a reverberation time (T30) of 0.54 s. The Living Room is connected to a kitchen (4.97 m × 2.00 m × 2.71 m) through an open door, with a volume of 26.94 m³ and a reverberation time of 0.66 s. In this scenario, the receiver (black circle in Figure 1) was seated on a sofa in the Living Room, while the target was located in the kitchen (green circle in Figure 1), oriented towards the open door between the rooms. For the SI measurement, a masker source was positioned on the right side of the participant (red circle in Figure 1). The direct line of sight to the target sound source was obstructed, with a path length (through the door) of 5.7 m.

The second scene (middle column in Figure 1) was set in a Pub (OLs Brauhaus in Oldenburg), which consisted of a single large room with dimensions of 17.76 m × 10.2 m × 2.9 m, resulting in a volume of approximately 442 m³ and a reverberation time (T30) of 0.7 s. The receiver was positioned at a table, with the target located directly opposite, 0.97 m away (see center of Figure 1). There was a nearby chalkboard located to the left of the receiver position. In the second row of Figure 1 at the bottom, the chalkboard is located at the position of the black rectangle, symbolizing a support beam. For the SI measurement, a masking signal was positioned on the right side of the participant (red circle in Figure 1). Despite its large volume, the Pub had a relatively low reverberation time.

The third scene (right-hand side of Figure 1) was located in an Underground Station (Munich, Theresien Straße), consisting of a large, elongated space (containing the platform and two tracks) measuring approximately 120 m × 15.7 m × 4.16 m, with a total volume of about 11,000 m³ and a reverberation time (T30) of 1.6 s. Additional coupled volumes from underground tunnels and the space above the escalators contributed to a dual-slope decay in the late reverberation. The receiver was positioned near the middle of the platform, with the target signal located 6.37 m in front of the receiver. For the SI measurement, a masking signal was positioned on the right side of the participant (red circle in Figure 1).

An overview of the geometric and acoustic characteristics of all environments is provided in Table 1; detailed material definitions and impulse-response measurements are documented in [46].

2.3. Room Acoustic Simulation

To simulate acoustics with varying ALODs, BRIRs and loudspeaker reproductions were generated using the Room Acoustics SimulatoR (RAZR) [11,13,18]. RAZR employs a “proxy shoebox room” ISM for early reflections and a feedback delay network for diffuse reverberation using physically based delay times. The accuracy and perceptual plausibility of RAZR have been validated in comparison to real acoustic environments [18,21,22,30]. Throughout this study, each scene was simulated with six distinct sets of features in the room acoustic simulation to vary the ALOD. This is referred to as Rendering in the following, additionally including measured BRIRs for headphone reproduction.

An overview of all rendering conditions and the corresponding acoustic modeling features is provided in Table 2.

(1) The highest ALOD utilized RAZR with all features, including coupled rooms and a third-order ISM for early reflections. The ISM employed jittering of image source positions to avoid unrealistic, overly regular reflection patterns, and temporal smearing of early diffuse reflections to simulate scattering and multiple reflections caused by geometric irregularities and objects within the room. This configuration is referred to as RAZR. (2) In the second configuration, labeled RAZR-1st-Order, the level of detail in early reflections was reduced by lowering the ISM order from three to one. (3) The third configuration, RAZR-Simple, involved disabling specific features of the acoustic simulation for each room: For the Living Room, the coupled room simulation was simplified by performing two separate simulations for each room. The kitchen was first simulated in RAZR, with an omnidirectional receiver placed at the (closed) door. The final simulation was generated by simulating only the Living Room with an omnidirectional virtual source at the center of the (closed) door, radiating the response from the kitchen. In the Pub, the nearby reflective surfaces of the table and chalkboard were removed. In the Underground Station, the coupled volume representing the underground tunnels and escalators, responsible for the dual-slope energy decay characteristic, was omitted. (4) The fourth set, referred to as ISM, further reduced ALOD by using a straightforward implementation of a 15th-order shoebox ISM, disregarding sound scattering effects and resulting in an unnatural, completely regular pattern of specular reflections. Despite its simplicity, this type of simplified simulation is still commonly employed in room acoustics research, particularly when the goal is to isolate specific acoustic parameters or to generate computationally efficient baselines. For instance, Rathnayake and Wanniarachchi [59] used a basic ISM to explore spatial echo patterns and reverberation characteristics in 3D environments, focusing on perceptual localization effects without incorporating scattering. Similarly, Scheibler et al. [60] originally introduced Pyroomacoustics as a Python-based simulation toolkit based on the image-source model for shoebox-shaped rooms. Since then, the library has been substantially extended to support more general room geometries, hybrid modeling approaches, and advanced features such as scattering and late reverberation (Pyroomacoustics documentation v0.8.6; [61]). (5) For the externalization experiment, a diotic condition, labeled Diotic, was added. For Diotic headphone tests, the left channel of the recorded signal was presented to both ears. For loudspeaker tests, Diotic was presented through a single loudspeaker positioned in front of the participant. (6) An additional control condition was generated for the SI experiment, referred to as Anechoic in the following. This set of BRIRs delivers no room reflections and only takes the distance between talker and listener, using the inverse-square law, and edge diffraction for the direct sound, as it occurs in the Living Room scene between receiver and target, into account.

For headphone reproduction, an HRTF was used, which matches the artificial head used in the respective measurement of the BRIRs in the real environment. For the simulation of the Living Room, the HRTF database of Braren and Fels [62] was used. The databases according to the recordings as specified in Grimm et al. [63] were used for the Pub, and as specified in Hladek et al. [64] for the Underground Station. Besides the HRTF, the matching source directivity was also used for each environment. For the Living Room and Pub, a measurement of a Genelec 8030 loudspeaker was conducted, for which the loudspeaker was placed in a rotating support in an anechoic chamber. Impulse responses were measured on a spherical grid at a distance of 2.85 m from the center of the loudspeaker, using an omnidirectional microphone. The azimuth resolution was 6°, and elevation was sampled at a step size of 3° around 0° elevation and sampled more sparsely toward the poles, for a total of 1680 directions. One hemisphere was measured and mirrored for the remaining azimuth angles. For the Underground Station, the source directivity of a BM6A MK-II loudspeaker was used [64]. After generating the BRIRs in RAZR, a post-processing step was conducted in order to match the average spectrum of the simulated impulse responses to the measured ones in the frequency domain. This adjustment ensured that, within a range of 100 Hz to 16,000 Hz, the average deviation between the simulated and the measured response was less than 0.5 dB across all three environments and for both the target and the masker position, based on third-octave smoothed magnitude spectra.

For loudspeaker reproduction, the simulated impulse responses were rendered over the 86-loudspekaer array (see Apparatus and Procedure) using vector-based amplitude panning (VBAP; [65]) according to the directional incidence of each component. The same spectral post-processing as for the BRIR generation was applied, using a single correction filter derived from the comparison of measured and simulated reference impulse responses and applied identically to all 86 channels.

2.4. Stimuli

The simulated (B)RIRs and, for headphone rendering, additionally, the recorded BRIRs from the real environments [46] (hereafter referred to as “Measured”) were convolved with one of four source stimuli. The first stimulus was speech material from the Oldenburger Satz Test (OLSA [66]), a matrix speech test composed of a large set of grammatically correct but semantically unpredictable sentences, each constructed from a total of 50 words with ten alternatives for each word type (name-verb-numeral-adjective-object). These sentences were spoken by a male speaker. For the plausibility test, eight random sentences were selected, while a single sentence was used for the evaluation of overall difference, the spatial audio quality items, and externalization to allow direct comparison between the BRIR sets using the identical target signal.

The second target stimulus was a deterministic transient signal, referred to as pink pulse. This 500-ms stimulus (sampling rate of 44.1 kHz) was generated in the frequency domain by defining a pink spectrum between 50 Hz and the Nyquist frequency with minimum phase and transforming it to the time domain, following the approach described by Kirsch et al. [67]. The signal decayed from 0 dBFS to −60 dBFS within 36 ms. For the plausibility test, the spectrum of the pink pulse was modified across 10 octave bands between 31 Hz and 16 kHz. Eight variations were generated by randomly assigning each band to one of three possible level conditions, +6 dB, −6 dB, or 0 dB relative to the unmodified pink spectrum, resulting in pulses with distinct tonal qualities ranging from “knocking on wood” to a “bursting balloon.” For the assessment of the overall difference and externalization, the original pink-colored pulse without spectral modification was used.

For the SI experiment, a male-transformed version [68] of the International Speech Test Signal (ISTS, [69]) served as a masker. The ISTS comprises nonsensical speech synthesized from recordings of six different speakers in various languages. Lastly, a 2-s sample of an electric bass guitar was used in the overall difference spatial audio quality test.

2.5. Apparatus and Procedure

For the headphone experiments, participants were seated in a soundproof booth with double walls, equipped with Sennheiser HD 650 headphones connected to an RME Fireface UCX audio interface running at a sampling rate of 44.1 kHz. All listening tests were conducted using Matlab. A computer monitor was placed in front of the participants, who provided their responses using a computer mouse and keyboard. For the loudspeaker-based experiments (overall difference and externalization tasks), measurements were conducted in the VR lab at the University of Oldenburg, an anechoic chamber containing a three-dimensional array of 86 Genelec 8030 loudspeakers; see, e.g., [70]. Participants were seated at the center of the loudspeaker array, with the main horizontal ring (radius 2.4 m, height 1.8 m) consisting of 48 loudspeakers. Two additional horizontal rings, each with twelve loudspeakers, were positioned at azimuth angles of −30° and 30°, and two more rings with six loudspeakers each were positioned at azimuth angles of −60° and 60°. Two loudspeakers were placed at −90° and 90° azimuth. Participants were seated on a chair located on a platform of 0.5 m height so that the height of the participant’s ear canal was approximately aligned with the loudspeakers in the main horizontal ring. A tablet computer was positioned in front of the participant, who used the touchscreen to provide responses. To calibrate the loudspeaker array and control the sound field at the listener’s position, a G.R.A.S. 46-DP-1 1/8” LEMO Pressure Standard microphone was placed at head height in the center of the array. Delay and broadband levels for each individual loudspeaker were adjusted based on sweep measurements, cf. [71]. The sweep, which had a duration of 3.2 s, spanned a frequency range from 100 Hz to 22,050 Hz, and an exponential sine sweep was employed to ensure even energy distribution across octaves.

In the plausibility experiment, a single stimulus, either real (based on recorded BRIRs) or simulated, was presented through headphones. Participants were then asked, “Was the stimulus real or simulated?” This constituted a two-alternative forced-choice test. The plausibility test involved six measurements, one for each scene (Living Room, Pub, Underground Station) and for the two types of simulated impulse responses used in the test (RAZR, ISM). For each measurement, 16 target signals, comprising 8 sentences and 8 pulses (see Section 2.4 Stimuli), were randomly presented three times, resulting in 48 presentations per measurement. Each measurement was performed twice, as both a test and retest. Before the measurements, participants underwent a brief familiarization session where they listened to example stimuli. These examples used BRIRs from different positions within the environments than those later used in the test. Both a pulse and a sentence from the speech material were presented during familiarization. The “dry” source stimuli were initially presented without any BRIR applied, followed by the same stimuli convolved with the measured and ISM BRIRs.

Participants were informed during the familiarization phase whether example stimuli were based on measured or simulated BRIRs to provide perceptual reference points and to support a consistent interpretation of the rating scales. This information was limited to the familiarization procedure only. During the actual test and retest measurements, no information about the rendering method was provided.

In the experiment, stimulus presentation order was fully randomized across participants. For each listener, a unique random sequence was chosen independently for the test and retest sessions, ensuring that the order differed between repetitions. Randomization was applied across stimulus types, rendering conditions, environments, and reproduction methods. To avoid systematic order effects, the randomization procedure ensured a balanced distribution of stimulus categories across the experiment.

This experiment was conducted in Matlab with the Alternative Forced Choice package (AFC [72]).

For measuring SI, the Oldenburger Satztest (OLSA [67]; see Section 2.4 Stimuli), was conducted with AFC [72] using the closed version of this test. In this version, the participants were presented a matrix of words on their screen after listening to the sentence. In this matrix, all 10 response alternatives for each word type were presented and the participant’s task was to click on each word they understood. A spatially separated masker (ISTS) was positioned towards the right side of the listener with a close distance of about 1 m in each environment (see red circles in Figure 1). The masker was presented with a level of 65 dB A whereas the target sentence varied in level after each sentence, starting with a level of 70 dB A. For each test, the signal-to-noise ratio was determined at which the participant could recognize 50% of the speech material (speech recognition threshold, SRT). Lists of twenty sentences were used for the BRIR sets Measured, RAZR, ISM, and Anechoic.

For the evaluation of spatial audio quality, eight items out of the Spatial Audio Quality Inventory (SAQI) [45] were chosen, which are listed in Table 3. To obtain ratings on the spatial audio quality items, a procedure, similar to the multi-stimulus test with hidden reference and anchor (MUSHRA; [73]) was used. The measurements were performed once with headphones and once within the 86-loudspeaker array. Listeners rated various stimuli relative to a reference with regard to the respective SAQI-item using sliders on the computer screen and were able to listen to the stimuli repeatedly and sort the stimuli according to their ratings. The sliders always displayed a continuous range from 0 to 100. For the overall difference item, the slider’s default position was set to 0, whereas for all other items it was set to 50. In both cases, the default position reflected “no difference to the reference”. The meanings of the maximum and minimum scores are listed in Table 1. A score of 50 represents no difference between the signal and the reference for the SAQI items.

The test signals were convolved with different BRIRs: Measured (only available for headphone presentation), RAZR, ISM, RAZR-Simple, and RAZR-1st-Order. RAZR served as the (hidden) reference for all SAQI items, whereas ISM effectively took the role of the anchor. For the overall difference, a single sentence of the OLSA test, the deterministic pink pulse (with no spectral variation applied) and a short sample of a played electric bass guitar were used as targets. For the other spatial audio quality items, only a pink-shaped pulse was used. This experiment was measured twice for each participant in a test and a retest.

The last experiment evaluated the perceived externalization. Again, a multi-stimulus test was used, similar to the procedure for the measurement of the overall difference, however, without a (hidden) reference. On a scale of 0–100, participants rated the perceived externalization with four verbal descriptors. One hundred was labeled “Clearly outside the head”, 66 represented “Close to the head”, 33 represented “Between the ears”, and 0 “Central in the head”. The subjects could listen to each signal repeatedly. For this experiment, Measured, RAZR, ISM, and Diotic were used. Similar to the test for the overall difference, this experiment was separated into six different tests, consisting of two different types of target signal (speech and pink pulse) and three different environments (Living Room, Pub, and Underground Station). Again, a test and a retest were conducted. Before the measurement with headphones, a short familiarization was performed. For this, the same environments were used as for the measurement, but with different positions for source and receiver. Stimuli generated with the measured BRIR were presented either diotically or binaurally to demonstrate the difference between internally and externally sounding signals.

For plausibility, spatial audio quality and externalization, the consistency of the subject responses was examined. The participant’s responses for the test and for the retest were separately averaged and the test-retest correlation was calculated. A minimum correlation coefficient of 0.8 was required and participants who did not reach this value were measured a second time, where they all reached the minimum value. For the SI experiment, a mean test-retest difference of not exceeding 1.5 dB over all conditions was required in order to be considered. However, this did not lead to re-measurements or dropped out data.

2.6. Statistical Analysis

For all measurements, plausibility, SI, overall difference, spatial audio quality, and externalization, a repeated-measures analysis of variance (rm-ANOVA) was performed using IBM SPSS Statistics 28.0. Greenhouse–Geisser correction was applied if sphericity was violated. Post-hoc pairwise comparisons used Bonferroni correction. For the overall difference ratings, two rm-ANOVAs were conducted: First, a four-way rm-ANOVA was performed with the within-subject factors Stimulus (Speech, Pulse, Bass), Environment (Living Room, Pub, Underground Station), Presentation Mode (Headphone, Loudspeaker), and Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order). Given that measured BRIRs were only available for headphone presentation, a second three-way rm-ANOVA was performed on the headphone data only, including the factors Stimulus, Environment, and Rendering (with the additional level Measured). This additional analysis was specifically used to compare the Measured and simulated Renderings. Likewise, for the spatial audio quality, two rm-ANOVAs were conducted, but only a single Stimulus (the pink pulse) was tested. Accordingly, the first analysis included the within-subject factors Environment, Presentation Mode, and Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order). The second analysis was performed on the headphone data only, including Environment and Rendering, again to allow direct comparison between Measured and the simulations. Unless stated otherwise, reported statistics refer to the four-way ANOVA. A summary of all significant main effects and first-order interactions is provided in Table A1 in the Appendix A.

3. Results

3.1. Plausibility

The results of the plausibility test are shown in Figure 2, representing a re-plot of the raw data from [42]. The average test-retest correlation across the ten participants was 0.882 and at least above 0.8 for each participant, indicating high rating consistency. The top panel shows results for speech and the bottom panel for the pulse. The color of the box plots indicates the Rendering condition. Here and in the following figures, green shows results for Measured, red shows RAZR, and blue shows ISM (from left to right). The circles represent individual results and the box plots show the median, the 25% and 75% percentiles (interquartile range), and the overall range. Asterisks indicate significant pairwise differences within each Environment and Stimulus, obtained from post-hoc comparisons conducted following a three-way rm-ANOVA [Stimulus (Speech, Pulse) × Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM)]. Across environments, the highest ALOD condition (RAZR) generally yields plausibility ratings comparable to the measured reference, indicating that listeners largely accepted these simulations as plausible representations of the real rooms.

The overall highest and consistent plausibility rating is found for Measured (green) and RAZR (red) in the Pub (middle column of Figure 2). Here, pairwise comparisons of marginal means averaged across headphone and loudspeaker reproduction revealed that in the Pub, ISM (blue) was significantly less often perceived as plausible than Measured [mean difference (md) = 48.31%; p = 0.004] and RAZR (md = 50.13%; p < 0.001). This pattern with a visually clearly lower plausibility for ISM was observed for both stimuli, Speech and Pulse. No significant differences between ISM and Measured and between ISM and RAZR were found when the speech material was used in the Living Room (top row, first column in Figure 2) and in the Underground Station (top row, third column in Figure 2). Although visually less apparent, Measured and RAZR received significantly higher plausibility ratings also in the Living Room and Underground for the Pulse stimulus (bottom row of Figure 2).

Overall, the ANOVA revealed a significant main effect of Rendering [F(1.034, 7.235) = 34.08, p < 0.001, η² = 0.830] but no main effect of Stimulus or Environment. A significant interaction between Stimulus and Rendering was found. Regarding the source of this interaction, post-hoc tests showed that a significant difference between speech and pulse only occurred for ISM, as observed by comparing the blue box plots of the top row with the blue box plots of the bottom row. Despite this, no significant differences on plausibility were found for Stimulus or Environment.

3.2. Speech Intelligibility

Figure 3 shows the mean SRT for the three environments obtained with headphone presentation. The color coding follows the convention used in Figure 2. In addition, the orange plots represent the results for Anechoic. The circles represent individual results and the box plots show the median, interquartile, and the overall ranges. Asterisks indicate significant pairwise differences between Renderings within each Environment, obtained from post-hoc comparisons following a two-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, Anechoic)]. Across all environments, Measured (green), RAZR (red), and ISM (blue) yield comparable SRT distributions, whereas the Anechoic condition (orange) consistently shows markedly lower SRTs, visible as a clear downward shift of the orange boxplots. In the Living Room, SRTs for Measured were significantly higher than for RAZR, while RAZR and ISM did not differ. In the Underground Station, Measured differed significantly from ISM, while RAZR and ISM did not differ.

Overall, the rm-ANOVA revealed a significant main effect of Rendering on SI [F(3, 21) = 146.48; p < 0.001; η² = 0.954], with post-hoc pairwise comparisons yielding the same significant differences as indicated for the Pub in Figure 3: RAZR (red) and ISM (blue) did not significantly differ from Measured (green), indicating that both simulated renderings reproduce SI close to the real-room reference. As also visually apparent in Figure 3, anechoic (orange), by contrast, yielded significantly lower SRTs than all other renderings (md = 5.31 dB vs. Measured, p < 0.001; md = 4.55 dB vs. RAZR, p < 0.001; md = 5.72 dB vs. ISM, p < 0.001). Additionally, a significant difference between RAZR and ISM was found (md = 1.17 dB, p = 0.005) with RAZR providing the lower SRTs.

A main effect of Environment was also found [F(1.206, 8.442) = 54.08; p < 0.001; η² = 0.885], with generally lower SRTs in the Pub, followed by the Living Room and the Underground Station as visible in Figure 3. The ANOVA also revealed a significant Rendering × Environment interaction [F(2.36, 16.54) = 12.71; p < 0.001; η² = 0.645], indicating that the Rendering differences varied somewhat across Environments as shown by the asterisks. However, this interaction does not reflect an effect of Environment, but rather differences in how the renderings performed within each Environment.

3.3. Overall Difference

Figure 4 shows the mean overall difference ratings for each Rendering condition, separated by Presentation Mode (upper three rows for Headphones, lower three for Loudspeakers) and by Stimulus (Speech, Pulse, Bass). The results for Speech and Pulse are a re-plot of the raw data from [42]. The colors of the box plots follow the color coding of the previous figures with two new Rendering conditions added here (from left to right): green, Measured (headphone presentation only); red, RAZR (reference); blue, ISM; black, RAZR-Simple; teal, RAZR-1st-Order. A value of 0 on the y-axis represents “no perceptible difference from the reference signal” and a value of 100 represents the rating “very large difference from the reference”. The circles represent individual results, and the box plots show the median, interquartile range, and overall range. The average test-retest correlation was 0.92, and results were pooled across repetitions. Black asterisks indicate significant pairwise differences from the reference Rendering (RAZR, red) within each Environment, Stimulus, and Presentation Mode, obtained from post-hoc comparisons following a four-way rm-ANOVA [Stimulus (Speech, Pulse, Bass) × Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)]. Red asterisks additionally indicate significant differences between Measured and RAZR, obtained from post-hoc comparisons following a three-way rm-ANOVA [Stimulus (Speech, Pulse, Bass) × Environment (Living Room, Pub, Underground Station) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)], covering the full factorial design for headphone presentation.

For the factor Rendering, post-hoc comparisons were conducted only relative to the reference condition (RAZR), as this rendering was available for both presentation modes and served as the baseline for all perceptual evaluations. In general, the hidden reference (RAZR) was consistently detected and rated with no difference, indicating reliability of the results. This is visible in Figure 4 as near-zero median difference ratings for RAZR across all environments, stimuli, and presentation modes. RAZR and RAZR-1st-Order (teal) were rated similarly across all conditions, while ISM (blue) and Measured (green) produced the largest perceptual differences from the reference. This ordering is consistently reflected across rows and columns of Figure 4, with ISM showing the highest median difference ratings in most conditions. In the Underground Station, RAZR-Simple (black) without the dual slope decay was also mostly rated with 0 difference to the reference. Confirmed by significant differences marked as black asterisks, ISM produced significantly higher difference ratings than RAZR in the Living Room and Underground Station, whereas Measured differed significantly from RAZR in all environments as indicated by the red asterisks.

The four-way rm-ANOVA revealed significant main effects of Stimulus, Environment, and Rendering, but no main effect of Presentation Mode. Specifically, Stimulus had a significant effect on the perceived overall difference [F(2, 14) = 60.63, p < 0.001, η² = 0.896], indicating that the rated differences depended on the type of target signal. Differences were generally rated higher for the Pulse than for Speech (md = 13.06; p < 0.001) or Bass (md = 12.49; p < 0.001) stimuli. This stimulus dependence is visible in Figure 4 as consistently higher distributions for Pulse compared to Speech and Bass across environments and renderings. A significant main effect of Environment was also found [F(2, 14) = 237.90, p < 0.001, η² = 0.971]. Post-hoc tests for Environment (not shown) revealed that larger differences occurred in the Living Room than in the Pub and Underground Station. Accordingly, Figure 4 shows generally higher difference ratings in the left column compared to the center and right columns.

The main effect of Rendering was highly significant [F(1.605, 11.235) = 470.88, p < 0.001, η² = 0.985] and post-hoc pairwise comparisons revealed that differences were significantly larger for ISM (md = 58.04, p < 0.001) and RAZR-Simple (md = 19.55, p < 0.001) than for the reference RAZR. In contrast, RAZR and RAZR-1st-Order differed only slightly (md = 0.85, p = 0.039). This pattern is clearly visible in Figure 4, where both Renderings show near-zero median differences across all stimuli. Across environments and stimuli, RAZR and RAZR-1st-Order were rated most similar, while ISM and RAZR-Simple produced substantially higher difference ratings.

Results from the additional three-way rm-ANOVA (headphone data only) showed a significant main effect of rendering and the related post-hoc tests confirmed that the Measured condition was consistently rated as significantly different from the RAZR reference across all environments as indicated by the red asterisks in Figure 4. No significant main effect of Presentation Mode was found [F(1, 7) = 2.32, p = 0.172, η² = 0.249], indicating that overall difference ratings were similar between headphone and loudspeaker presentation. This similarity is reflected in Figure 4 by comparable distributions across the upper (headphone) and lower (loudspeaker) panels. A Stimulus × Environment interaction was found [F(4, 28) = 9.99, p < 0.001, η² = 0.588]. Post-hoc tests showed that this interaction was driven by the Bass stimulus: ratings in the Living Room differed significantly from those in the Pub and the Underground Station. For Speech and Pulse, no such differences between environments occurred.

3.4. Spatial Audio Quality Items

Figure 5 shows the results for headphone (Figure 5 left) and for loudspeaker (Figure 5 right) presentation. Each row in both main panels represents one spatial audio quality item and each column an environment. The horizontal line in each plot marks a score of 50, corresponding to “no perceptual difference between the test signal and the reference.” Circles represent individual results and the box plots show the median, interquartile, and the overall ranges. The color coding follows the convention of Figure 4: green = Measured (headphones only), red = RAZR (reference), blue = ISM, black = RAZR-Simple, teal = RAZR-1st-Order. Asterisks indicate significant pairwise differences from the reference (RAZR) within each environment, obtained from post-hoc comparisons following a three-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)] performed for each spatial audio quality attribute. For the factor Rendering, post-hoc comparisons were performed only relative to RAZR, as it served as the common reference across presentation modes. Additional two-way rm-ANOVAs [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, RAZR-Simple, RAZR-1st-Order)] for the headphone data (Figure 5 left) revealed no significant differences between the Measured condition and the RAZR reference for any of the sound-quality items.

A first inspection of Figure 5 shows that large perceptual differences between renderings occurred only occasionally. Specifically, Figure 5 shows that ratings for RAZR and RAZR-1st-Order remain tightly clustered around the “no difference” line at 50 across most items and environments, whereas larger deviations are primarily associated with ISM (blue) and, to a lesser extent, RAZR-Simple (black). When present, these differences were most pronounced in the Living Room. This pattern is visible across multiple rows in the left column of Figure 5, where ISM consistently shows larger deviations from the reference than in the Pub or Underground Station. The data also show considerable inter-individual variability, resulting in relatively few statistically significant differences overall, as reflected by the sparse asterisks in each panel. No major differences could be found between the two Presentation Modes, Headphones and Loudspeaker. This similarity is reflected in Figure 5 by comparable distributions and medians between panels A (headphones) and B (loudspeakers) for most items and renderings. For headphones (Figure 5 left), Measured (green) showed small deviations from the RAZR reference (red); however, these differences often occurred in both positive and negative directions across participants and environments, resulting in mean ratings that remained centered around the reference, between 40 and 60. Accordingly, Figure 5 left shows Measured ratings scattered symmetrically around the reference line, rather than a consistent shift in a single perceptual direction. Consequently, no significant differences between Measured and RAZR were observed, except for Source Width in the Underground Station. For tone color, a small but consistent (yet non-significant) difference is visible between Measured and RAZR for all environments. This can be observed in Figure 5 left as a slight upward shift of the green distributions relative to the red reference across environments, without exceeding the variability observed across listeners.

The following description summarizes the significant main effects, 1st-order interactions, and post-hoc results obtained from the three-way rm-ANOVAs:

Effect of Environment. The three-way rm-ANOVAs revealed significant main effects of Environment for the items Distance (η² = 0.69), Reverberation (η² = 0.81), Source Width (η² = 0.67), Tone Color (η² = 0.40), and Metallic Tone Color (η² = 0.40). In general, perceived differences from the reference in Distance and Reverberation were rated higher in the Living Room than in the Pub and Underground Station.

For Reverberation, significant interactions were observed with both Presentation Mode and Rendering. A significant Environment × Presentation Mode interaction showed that, for headphone presentation, perceived differences from the reference were significantly larger in the Living Room than in the Underground Station, whereas for loudspeaker presentation, the Living Room differed from both the Pub and the Underground Station. This indicates that environment-dependent deviations in perceived reverberation from the reference were more pronounced under loudspeaker playback.

Furthermore, a significant Environment × Rendering interaction indicated that these Environment-dependent variations occurred primarily for the ISM rendering, whereas the other renderings showed more consistent ratings across environments. This pattern suggests that the reverberant character reproduced by ISM was particularly sensitive to the acoustic context, resulting in stronger perceived deviations from the reference in the Living Room.

Visually, this is reflected in Figure 5 by larger ISM deviations in the Living Room than in the other environments, especially for reverberation-related items. Several other higher-order interactions between Environment and other factors (e.g., Rendering) reached statistical significance; however, post-hoc comparisons revealed no systematic or interpretable trends beyond the main effects. Occasional differences appeared for individual renderings within single environments (e.g., ISM or RAZR-Simple in the Living Room), but these effects were inconsistent across items and did not alter the overall pattern of results. Therefore, only the main effects and interactions that revealed coherent, perceptually meaningful differences are reported in detail.

Effect of Rendering. The three-way rm-ANOVAs revealed significant main effects of Rendering for the items Distance (η² = 0.55), Metallic Tone Color (η² = 0.43), and Reverberation (η² = 0.43), indicating that the perceived deviation from the reference depended on the rendering method. In general, ISM and RAZR-Simple led to the largest perceptual deviations from the reference (i.e., ratings that were shifted furthest away from the “no difference” point at 50), whereas RAZR and RAZR-1st-Order produced ratings closest to 50 and were therefore most similar to the reference.

For Distance, post-hoc comparisons showed a significant difference only between RAZR-Simple and RAZR-1st-Order. No other pairwise comparisons between renderings reached significance. As seen in the first column, first row of Figure 5 left, this effect was mainly driven by the Living Room condition, where RAZR-Simple produced larger deviations from the reference than RAZR-1st-Order. In the other environments, the two renderings were more similar. No consistent differences were observed between the remaining renderings.

For Metallic Tone Color, post-hoc tests revealed significant differences between RAZR and RAZR-Simple, and between RAZR-Simple and RAZR-1st-Order. No other rendering pairs differed significantly. In the Living Room (Figure 5 left, fourth row, first column), RAZR-Simple produced a perceptual shift in metallic coloration relative to both the full RAZR and the RAZR-1st-Order conditions, suggesting that this attribute was particularly sensitive to the simplified coupled-room configuration used in RAZR-Simple.

For Reverberation, although the main effect of Rendering was statistically significant, none of the pairwise post-hoc comparisons between renderings reached significance. Thus, the overall effect of Rendering on perceived reverberation differences from the reference was small and not attributable to a specific pair of renderings. Overall, Figure 5 reveals a clear hierarchy of rendering performance: RAZR and RAZR-1st-Order remain closest to the reference across most items, RAZR-Simple shows moderate deviations depending on environment, and ISM produces the largest and most systematic perceptual differences. Several significant interaction effects were observed between the experimental factors, most prominently involving Rendering and Environment.

For the items Distance, Metallic Tone Color, and Reverberation, the Rendering × Environment interactions followed a consistent pattern. Significant rendering differences occurred mainly in the Living Room and were driven by the ISM and RAZR-Simple renderings, both of which were rated as perceptually more different from the reference than the other renderings. The remaining renderings showed more stable ratings across environments and remained perceptually closer to the reference.

Two notable exceptions were observed. For Source Width, significant rendering differences appeared only in the Underground Station, where ISM again differed from all other renderings. For Envelopment by Reverberation, a Rendering × Presentation Mode interaction indicated that significant differences between RAZR and ISM occurred only under Loudspeaker presentation, but not with Headphones. Overall, these findings indicate that rendering-related perceptual deviations were most pronounced for ISM and, to a lesser extent, RAZR-Simple, particularly in the Living Room and Underground Station.

Effect of presentation mode. A significant main effect of Presentation Mode was found for Envelopment by Reverberation (η² = 0.46). Ratings were, on average, about seven points lower under Loudspeaker presentation than under Headphone presentation, indicating a reduced sense of envelopment when reproduced over loudspeakers.

A significant Rendering × Presentation Mode interaction was also observed for Envelopment by Reverberation. Post-hoc comparisons showed that this difference between presentation modes occurred only for the ISM rendering, whereas all other renderings yielded comparable ratings for headphones and loudspeakers. This interaction is visible in Figure 5 as a downward shift of ISM ratings for Envelopment under loudspeaker presentation compared to headphones. In addition, a significant Environment × Presentation Mode interaction was found for the item Reverberation. Here, the Pub environment showed a presentation-dependent difference: under Headphone presentation, reverberation was rated slightly lower (−4.2 points relative to the reference), whereas under Loudspeaker presentation it was rated higher (+6.7 points). No such differences between presentation modes were observed in the other environments.

Overall, these findings suggest that the influence of presentation mode on the spatial audio quality ratings was limited and primarily affected perceived Envelopment, particularly for the ISM rendering and in the Pub environment.

3.5. Externalization

Figure 6 shows the mean externalization ratings (representing a re-plot of the raw data from [42]) for each Rendering condition, separated by Presentation Mode (upper two rows, Headphones; lower two rows, Loudspeakers) and by Stimulus (Speech, Pulse). The color coding follows the convention of the previous figures: green = Measured (headphones only), red = RAZR, blue = ISM, and yellow = Diotic. The circles represent individual results, and the box plots show the median, interquartile, and the overall ranges. The average test–retest correlation was 0.84, and results were pooled across repetitions. Asterisks indicate significant pairwise differences between renderings within each Environment, Stimulus, and Presentation Mode, obtained from post-hoc comparisons following a four-way rm-ANOVA [Stimulus (Speech, Pulse) × Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, Diotic)]. An additional two-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, Diotic)] was conducted in order to analyze differences between Measured and the simulation for the headphone data.

Across both Stimuli, ratings for RAZR and Measured produced the highest externalization ratings, followed by ISM. This pattern is consistently visible in Figure 6 as higher median ratings for RAZR and Measured across all environments. For headphone measurements, Diotic was the least externalized simulation, while it was among the most externalized simulations for the Loudspeaker presentations. This pronounced reversal between presentation modes is clearly visible in Figure 6, particularly for both stimuli across all environments. For Speech, externalization was generally weak under headphone presentation across all renderings, suggesting that listeners found speech particularly prone to in-head localization in this playback condition, whereas the Diotic rendering yielded the poorest externalization overall, as was expected.

In the following, only main results and first-order interactions from the four-way ANOVA are reported. The three-way ANOVA did not reveal any additional relevant findings. The four-way rm-ANOVA revealed significant main effects of Presentation Mode [F(1, 7) = 9.39; p = 0.018; η² = 0.57] and Rendering [F(2, 14) = 8.04; p = 0.005; η² = 0.54]. Overall, sounds were rated as more externalized under loudspeaker presentation than under headphone presentation, indicating that externalization benefited from the additional spatial cues available during loudspeaker playback.

Post-hoc comparisons confirmed that RAZR was rated significantly more externalized than ISM (p < 0.05), whereas the difference between RAZR and Diotic did not reach significance. This result is reflected in Figure 6 by the consistently lower median externalization ratings for ISM compared to RAZR across environments.

A significant Environment × Rendering interaction in the four-way rm-ANOVA was observed. In the Living Room, ISM was rated significantly less externalized than RAZR, while in the Pub and Underground Station, no significant differences between renderings were found. Visually, this interaction manifests as larger separation between ISM and RAZR in the Living Room panels compared to the other environments. This pattern suggests that the rendering-dependent differences in perceived externalization were most pronounced in less reverberant acoustic contexts. A significant Presentation Mode × Rendering interaction was also found in the four-way rm-ANOVA. Post-hoc tests revealed that this interaction was driven solely by the Diotic condition: for headphone presentation, Diotic signals were rated strongly internalized (mean = 18.8), whereas for loudspeaker presentation, the same condition yielded substantially higher externalization ratings (mean = 63.1; md = 44.4, p < 0.001). This strong interaction is directly observable in Figure 6 as a marked upward shift of Diotic ratings from the headphone to the loudspeaker panels. For RAZR and ISM, no significant differences between headphone and loudspeaker presentation were observed.

Taken together, these results show that perceived externalization was strongly influenced by both presentation mode and rendering method. Externalization was consistently highest for RAZR, lowest for Diotic, and moderately reduced for ISM, with the most pronounced rendering effects occurring under headphone presentation and in the Living Room environment.

4. Discussion

4.1. Role of ALOD Across Different Perceptual Measures and Stimuli

The perceptual influence of the ALOD varied considerably across perceptual measures, Environments, Stimulus types, and Rendering conditions. For the speech stimulus, plausibility judgments were relatively robust to reductions in ALOD as long as essential acoustic cues such as the overall reverberation time and the basic pattern of early reflections were maintained. In both the Living Room and Underground Station, even the simplest ISM rendering achieved plausibility ratings close to those obtained with the measured BRIRs. This suggests that for speech, which contains limited spectral content and temporal modulations lacking strong transients, perceptual plausibility does not rely on a highly detailed representation of the reverberant sound field as long as the general acoustic characteristics of the environment are preserved.

SI exhibited a similar pattern of robustness. Across all Environments, SRTs remained stable for most Renderings and showed no significant degradation compared to the measured reference. This aligns with established evidence that the coarse pattern of early-arriving reflections and the direct-to-reverberant energy ratio, both of which are maintained in all renderings, are dominant factors affecting speech perception in rooms [52]. Only in the Underground Station did ISM lead to significantly higher SRTs than for Measured. This is likely related to an unrealistically sparse pattern of reflections obtained in the shoebox model of the room, in connection with an overly slow increase in echo density. This property of the ISM has been shown for a large room volume in Ewert et al. [18] using the normalized echo density (NED; [74]), demonstrating that a slow buildup of echo density reduces the temporal and spatial continuity of the reverberant field, and likely impairing SI in this acoustically complex space.

The sensitivity to ALOD increased markedly for the Pulse and Bass stimuli, where transient onsets and broadband spectral content accentuated differences between Renderings. Plausibility ratings dropped sharply for ISM in all Environments, and Overall Difference ratings showed the largest deviations from the reference for ISM and RAZR-Simple. These findings demonstrate that renderings with reduced ALOD, such as fewer early reflections, missing scattering effects, or overly regular reflection patterns, fail to reproduce the fine temporal and spectral cues that are perceptually relevant for transient or broadband stimuli.

For the spatial audio quality items, which were evaluated using only the “most critical” Pulse stimulus, the effects of reduced ALOD were particularly evident in the Living Room: Both ISM and RAZR-Simple were perceived as significantly different from the reference, particularly for Distance, Reverberation, and Metallic Tone Color. These deviations point to inaccuracies in the representation of coupled-room effects and the spatial distribution of early reflections. In RAZR-Simple, the simplified coupled-room configuration did not yield a perceptually equivalent reproduction of the reference, underlining that modeling simplifications in geometrically complex Environments directly impair perceived spatial audio quality. Although the Overall Difference results in Figure 4 revealed significant perceptual deviations between the Measured and RAZR, these differences were not mirrored in the ratings of the individual spatial audio quality attributes. This suggests that overall difference may reflect a composite percept that integrates several of the tested spatial audio quality attributes even when their individual contribution did not reach significance, such as spectral coloration, or was differently but inconsistently rated. Alternatively, the underlying spatial audio quality differences may not have been fully captured by the here selected seven attributes. Future studies could address this by expanding the attribute set or by applying data-driven analyses, such as multidimensional scaling, to identify the dominant perceptual dimensions contributing to global judgments of simulation fidelity.

The influence of ALOD reductions in RAZR-Simple was particularly evident in the Pub, where Overall difference ratings showed larger deviations from the reference. These deviations can be attributed to the omission of two nearby reflecting surfaces, a chalkboard and a table, resulting in a perceptible change in spatial impression. Listeners reported a missing lateral reflection from approximately 30–40° to the left of the source direction, where the chalkboard is positioned. This omission was particularly salient under loudspeaker presentation, where natural binaural cues and head movements enhanced spatial resolution. The higher overall difference ratings for RAZR-Simple under loudspeaker playback thus likely reflects the absence of these distinct early reflections.

In contrast, for the Underground Station, RAZR-Simple differed from RAZR through the omission of the dual-slope decay in the late reverberation tail. Despite this physical simplification, no perceptual differences were found between the two renderings. Analysis of the normalized energy decay curves shows that the secondary slope began only at a very low energy level (below roughly −40 dB), where its contribution to perceived reverberant character is minimal. Consequently, both versions produced nearly identical decay behavior within the time-intensity region most relevant to perception. This explains why the absence of the dual-slope tail did not lead to perceptual degradation, even though the underlying acoustic model was simplified.

Externalization results provided additional evidence that variations in the ALOD influence spatial aspects of auditory perception. RAZR produced the highest externalization ratings, followed by ISM, while Diotic signals were perceived as strongly internalized. The reduced externalization for ISM suggests that missing spatial detail and inaccurate room-related cues limit the perception of external auditory space. Similar tendencies have been reported in studies linking spatial externalization to the precision of binaural and room-related cues [49]. Interestingly, across all three environments, the Diotic condition was in most cases rated as more externalized than the simulated renderings, with the exception of the Pub for the Pulse stimulus. This finding may be explained by the loudspeaker presentation setup. In the Diotic condition, a single loudspeaker positioned 2.4 m in front of the listener was used. For some participants, this clear frontal localization may have been perceived as more external than the more spatially diffuse sound fields generated by the multichannel renderings, which distribute acoustic energy across multiple loudspeakers from different directions. Such increased spatial diffuseness can reduce externalization by weakening the consistency between auditory and spatial cues [57,75].

One important difference between the current ALOD variations and earlier studies such as Martin et al. [43] and Abd Jalil et al. [29], is that the current acoustic model was always based on a “proxy” shoebox simplification of each room, as outlined above. Thus, even with the highest ALOD, the pattern of early reflections was always approximate rather than geometrically exact. Despite this simplification, the RAZR rendering produced plausibility, SI, spatial audio quality, and externalization ratings that closely matched those obtained with the measured reference, confirming that perceptual agreement can be achieved even when using simplified geometries, provided that key temporal and spectral room characteristics are accurately represented (see also result for RAZR in [21]; simulation B). Nevertheless, the shoebox approximation becomes more critical for lower ALOD conditions, particularly for ISM, where the lack of geometric detail led to sparse and overly regular reflection patterns. Such patterns reduce echo density and temporal diffusion, impairing the continuity of the reverberant field, as discussed above and shown for large-volume spaces in [18]. This highlights that while high-ALOD shoebox simulations can reproduce perceptual realism when properly parameterized, further simplifications risk perceptual degradation due to insufficient spatial and temporal reflection density.

Taken together, these findings indicate that the perceptual importance of ALOD depends strongly on the specific perceptual measure as well as on the stimulus type. Measures associated with speech as a stimulus and speech communication in general, such as plausibility and intelligibility, are comparatively tolerant to rendering simplifications as long as the overall reverberation time, the balance between direct and reverberant energy, and the general distribution of early reflections are preserved. In contrast, broadband and transient stimuli such as the Pulse emphasized differences in reflection timing, coloration, and spatial coherence, leading to stronger perceptual deviations when ALOD was reduced, as also found in [67]. For the Bass stimulus, overall difference ratings were lower across environments and renderings. Across all stimuli, higher ALOD was particularly critical for perceptual dimensions relying on fine spatial or spectral cues, such as spatial audio quality attributes and externalization. The results for RAZR-Simple further demonstrate that reducing geometric detail, particularly in coupled spaces, can lead to noticeable perceptual deviations.

4.2. Role of Environment

The acoustic Environment had a pronounced influence on the sensitivity to changes in ALOD across most perceptual measures. Across perceptual measures, the Living Room consistently produced the largest deviations from the reference, while the Underground Station and the Pub generally yielded smaller or more stable differences. The Living Room was particularly challenging to simulate due to its coupled-room configuration, which introduces complex reflection patterns and modal interactions that are difficult to reproduce accurately with simplified geometric models.

For plausibility and spatial audio quality, both ISM and RAZR-Simple produced perceptually larger deviations from the reference in the Living Room than in the other Environments. The results suggest that the critical coupled-volume geometry amplified perceptual differences between Renderings, particularly for attributes related to reverberation and tonal coloration. This is consistent with earlier studies demonstrating that the perception of spatial fidelity in coupled or irregularly shaped rooms depends strongly on the correct reproduction of reflection density and decay characteristics [23,49].

In contrast, the Pub and Underground Station exhibited greater perceptual tolerance to simplified modeling for plausibility, spatial audio quality, and externalization. In these Environments, the long or diffuse reverberation provided a masking effect that reduced sensitivity to errors in spatial detail. In the Underground Station, interestingly plausibility ratings for ISM remained close to the reference; however, SI was reduced due to the unrealistically sparse reflection pattern and slow buildup of echo density characteristic of the ISM, as discussed above. These deficiencies likely disrupted the temporal continuity of the reverberation and thereby impaired speech cue transmission. Similar masking effects of diffuse late energy on spatial perception have been observed in previous studies [54,76]. Our current SI results for the Underground Station align well with those reported by Hladek and Seeber [64], who measured SRTs at the same position (denoted as R1, pos. 16 in their study). Likewise, Schütze et al. [4] evaluated SI at Position S7 in the real Living Room with the same location for receiver and target source that was used in our study, and reported SRTs comparable to our measured reference. These findings support the validity of our measurements.

The Environment also influenced the results for individual spatial audio quality attributes. Perceived Distance and Reverberation were rated most different from the reference in the Living Room, whereas Source Width and Metallic Tone Color showed smaller variations across Environments. These findings correspond to the significant Environment × Rendering interactions, which were mainly driven by the ISM rendering. The ISM produced particularly large deviations in the Living Room, indicating that its reverberation was more affected by the acoustic complexity of the environment. In contrast, the other renderings yielded more consistent ratings across Environments, suggesting that the perceptual impact of ALOD reductions depends on both the simulated environment and the rendering method.

For externalization, the influence of the Environment was limited overall but became apparent in the Living Room, where ISM was perceived as significantly less externalized than RAZR, as the Environment × Rendering interaction showed. This indicates that in acoustically complex but less diffuse spaces such as the Living Room, listeners were more sensitive to rendering inaccuracies that disrupt binaural or room-related cues, which are known to support externalization [56,75]. In contrast, in the more reverberant Pub and Underground Station, late reflections masked these differences, leading to comparable externalization ratings across renderings.

A closer examination of the Underground Station data revealed that externalization ratings for speech were generally low across all renderings under headphone presentation. This finding can be explained by the room’s specific acoustic characteristics: although it exhibits a long reverberation time, the overall reverberation level is comparatively low. As a result, much of the reverberant energy remains masked by the speech and, therefore, contributed little to externalization [28,55]. In contrast, for the Pulse stimulus, the reverberant decay was always audible, allowing late reflections to enhance spatial impression and perceived externalization. In this condition, the Measured rendering produced slightly higher externalization ratings than RAZR, likely reflecting the additional fine spatial and spectral details contained in the measured room responses.

Taken together, these findings demonstrate that the acoustic properties of the Environment shape how listeners evaluate rendering fidelity. In acoustically simple or diffuse spaces, errors in simulation detail are perceptually less relevant due to reverberant masking. However, in Environments with coupled volumes, strong directional reflections, or complex geometries, perceptual sensitivity to rendering accuracy increases considerably. Consequently, the required level of ALOD depends not only on the perceptual measure but also on the acoustic structure of the Environment being reproduced.

4.3. Role of Presentation Mode

Presentation mode affected spatial perception differently across the evaluated measures. For Overall Difference, headphone and loudspeaker presentations yielded statistically similar ratings, indicating that the perceived global similarity between renderings was largely independent of presentation mode. In contrast, for measures directly related to spatial hearing, such as externalization, presentation mode had a pronounced effect, with loudspeaker playback generally enhancing spatial impression. This suggests that participants were able to form consistent judgments about rendering fidelity even when the spatial information was conveyed through individualized binaural playback rather than through free-field loudspeaker reproduction.

In contrast, several spatial audio quality attributes were modestly affected by presentation mode. One notable difference appeared for Envelopment by Reverberation, where ratings for ISM were, on average, lower than the other Renderings under loudspeaker presentation, which was not observed for headphones. This reduction likely reflects the additional spatial cues available in the other Renderings within the physical playback environment, which may have altered listeners’ perception of the simulated reverberant field [28,55]. For the ISM rendering, this effect was particularly pronounced, suggesting that limited spatial coherence in the simulation was perceptually masked under headphones but revealed more clearly when compared to the real acoustic context of loudspeaker playback. A smaller, environment-specific presentation mode effect also occurred in the Pub, where reverberation was rated slightly weaker under headphones and stronger under loudspeakers, possibly due to differences in the integration of simulated and real spatial cues.

The influence of presentation mode was most apparent for Externalization, which showed a strong and systematic dependence on playback condition. Externalization ratings were generally higher under loudspeaker presentation, reflecting the contribution of naturally occurring head movements and dynamic spatial cues that reinforce external perception [56,77]. Under headphone playback, where such cues were absent, listeners were more susceptible to in-head localization, especially for the Diotic rendering. This rendering produced strongly internalized images during headphone presentation but was perceived as substantially more externalized when played over loudspeakers, confirming that the availability of congruent spatial cues from the playback setup can compensate for missing binaural information. In contrast, RAZR and ISM maintained stable externalization ratings across presentation modes, indicating that their spatial renderings provided sufficient directional consistency to support externalization even under headphone reproduction. Beyond head-movement-related dynamic cues, externalization under headphone reproduction is influenced by several additional factors. While a spectrally smoothed headphone equalization filter was applied in the present study to reduce spectral coloration differences between the renderings, externalization remains sensitive to mismatches between a listener’s individual HRTFs and the non-individualized HRTFs used for binaural rendering. Such mismatches can lead to spectral inconsistencies that weaken distance and externalization cues, particularly under static listening conditions. Previous work has shown that even when headphone transfer functions are equalized, deviations from individual binaural cue patterns can substantially reduce perceived externalization [56,77]. In contrast, loudspeaker playback naturally provides individualized spectral cues and dynamic head-movement information, which likely contributed to the higher externalization ratings observed in this condition.

Taken together, these findings demonstrate that presentation mode primarily affects perceptual dimensions that depend on dynamic or spatially extended cues, such as externalization and envelopment. When accurate room simulation is available, headphone reproduction can yield perceptually equivalent results for most spatial audio quality attributes, but for the perception of spatial extent and realism, loudspeaker playback continues to offer a distinct advantage.

4.4. Comparison to Geometry Decimation and Outlook

Starting from the current highest ALOD, which already includes substantial geometric simplifications by using a proxy rectangular room, the results showed that particularly for speech, further reductions of ALOD were perceptually tolerable—for instance, reducing the number of early reflections or omitting scattering effects, as in the ISM. It should be noted that this lowest tested ALOD (ISM) would actually be more computationally demanding than the highest ALOD (RAZR), again underlining that the present definition of ALOD is not based on geometric or computational complexity but rather on perceptual properties of the room simulation. The inclusion of nearby reflecting surfaces in the Pub environment specifically added geometric detail to the room acoustics simulation, improving over the basic shoebox simplification for early reflections. In the current results, this addition proved perceptually important, as the omission of these reflectors in RAZR-Simple led to clearly detectable spatial differences under loudspeaker presentation.

Bridging the gap to studies that have investigated geometry decimation, e.g., [29,43], future research should extend the current approach by systematically increasing geometric detail stepwise from the shoebox approximation to assess potential benefits in conditions where the real environment cannot be meaningfully modeled by connected rectangular rooms, such as connected corridors, obstructing structures, or columns. The role of edge diffraction has already been examined in recent studies, e.g., [78], and our current acoustic modeling framework has been extended to cover arbitrary geometry and edge diffraction using computationally efficient filter-based approximations [47,48,79,80].

4.5. Methodological Limitations and Generalizability

A potential methodological limitation of the present study is that all participants had prior experience with psychoacoustic listening tests. While this may restrict direct generalizability to the broader population, listener expertise was considered advantageous for the attribute-based perceptual tasks employed here. The evaluation of sound quality attributes such as spectral coloration, envelopment, or source width requires a consistent interpretation and use of perceptual rating scales, which has been shown to benefit from prior training and listening-test experience [27,28].

Previous research has demonstrated that naïve listeners are often able to detect perceptual differences, but may exhibit increased variability and reduced consistency when assigning ratings to specific auditory attributes or when interpreting scale directions, particularly for abstract spatial or timbral descriptors [81]. Standardized listening-test methodologies therefore commonly recommend the use of trained listeners for analytical quality assessments to improve reliability and reduce response variance [73].

The limited number of statistically significant effects observed for sound quality attributes in the present study despite clearly detectable overall differences, however, may cast some doubt on the level of experience of the listeners. Alternatively, it suggests that these attributes are only weakly affected by the tested reductions in ALOD. This interpretation is further supported by a sensitivity power analysis of the repeated-measures design, which indicated that the available sample size (N = 8) was sufficient to detect medium to large within-subject effects, but not small effect sizes. Accordingly, subtle perceptual differences and higher-order interactions may remain undetected, particularly for attribute-based sound quality ratings that are known to exhibit substantial inter-individual variability. Nevertheless, future work could complement the present findings by including naïve or more heterogeneous listener groups, particularly for preference-based judgments or evaluations of ecological validity in everyday listening contexts.

Participants were informed during the familiarization phase whether example stimuli were based on measured or simulated BRIRs. Importantly, familiarization stimuli were generated from different source and receiver positions within the same environments than those used in the actual measurements, such that the familiarization stimuli were acoustically distinct from the test stimuli. While this procedure may have introduced expectations during familiarization, the use of different spatial configurations limits the possibility that participants learned fixed perceptual cues associated with the measurement condition.

A further limitation of the present study is that all simulations were generated using a single room-acoustic simulation framework, namely RAZR. While this allowed for a controlled and systematic manipulation of ALOD within a consistent modeling environment, the extent to which the present findings generalize to other room-acoustic simulation approaches remains an open question. While different algorithms, such as ray-tracing methods, or hybrid and wave-based models, may differ in how specific modeling simplifications affect perceptual outcomes [29,36,46], the present results may be considered representative for image source methods. The comparison to the “straightforward” higher-order ISM and the systematic reduction of image source order in RAZR underlines the importance of correctly accounting for the late reverberant tail, an observation which should generalize beyond the specific implementation used here.

In addition, it should be noted that even the highest ALOD condition investigated in the present study was based on a simplified proxy shoebox geometry. This approximation has been shown to provide perceptually plausible results for a range of indoor environments: The underlying RAZR framework has previously been evaluated in a broader context, including geometrically complex real-room scenarios, for example in the room-acoustical simulation and auralization round robin reported by Brinkmann et al. [21]. Nevertheless, the applicability of the specific ALOD manipulations investigated here to geometrically more complex spaces, such as environments with irregular layouts, obstructing structures, columns, or strongly non-rectangular geometries, remains an open question. Particularly for curved surfaces and vaulted spaces, e.g., [82,83], the underlying proxy shoebox approach is not able to cover sound focusing and spreading. In such cases, additional geometric detail or alternative, e.g., wave-based modeling may be required to preserve spatial cues related to early reflections, diffraction, and directional energy flow. For low-frequencies, it is expected that wave-based approaches show a better agreement with measurements and hybrid wave/ray-based approaches [82,84] have shown a good agreement between measurement and simulation for the entire audio frequency range, with modeling detail depending on simulation paradigm and frequency range. Future work should replicate the present experimental paradigm using different simulation engines to assess the robustness of the observed trends across modeling approaches and to identify algorithm-specific sensitivities.

Finally, head tracking was not employed during the headphone-based experiments, and all binaural renderings were presented under static listening conditions. As a result, the present findings primarily apply to scenarios without dynamic listener–environment interaction, such as conventional headphone listening or VR/AR applications with limited or unavailable head tracking. Dynamic head movements are known to provide additional spatial cues that enhance externalization and spatial perception. Consequently, the perceptual effects of ALOD reductions observed in the present study may be altered in systems employing accurate real-time head tracking.

5. Conclusions

Plausibility, SI, spatial audio quality, and externalization were investigated with different acoustic level of detail of the room acoustics simulation in three (real-world) environments with substantially different sizes and (room acoustic) properties. Headphone and loudspeaker reproduction were tested, including comparison to recordings from the real environments for headphones. It was demonstrated that highly plausible simulations with externalization comparable to dummy head recordings can be achieved with considerable simplifications in the acoustic simulation: Even the most detailed ALOD applied in this study utilized a “shoebox” approximation of room geometry for early reflections. However, scattering effects were incorporated, and diffuse late reverberation was modeled using a physics-based feedback delay network. Overall, the results indicate that ALOD should be chosen in relation to the intended application, more reduced models are suitable for communication-oriented simulations using speech stimuli, whereas broadband, transient stimuli require higher fidelity of the model:

SI remained largely stable across all tested ALOD levels, including the highly simplified shoebox image source model. For the large volume of the Underground Station, overly sparse early reflections and a too slow rise of echo density in the image source model are a possible source for the observed deviations for SI.
Speech stimuli proved relatively robust, producing only small perceptual differences between renderings with different ALOD, whereas the broadband, transient pulse stimulus revealed pronounced perceptual differences for simplified simulations. Plausibility and overall difference ratings were particularly sensitive to ALOD reductions for the pulse.
No clear-cut contribution of specific spatial audio quality attributes to the observed overall difference between simulated and measured environments was found. There was a trend supporting that coloration differences, although not significant, in connection with perceived differences in reverberation, distance, and metallic tone color might be the most influential attributes.
While overall difference and spatial audio quality ratings were largely comparable for both (static) headphone and loudspeaker array reproduction, externalization was clearly improved for loudspeaker reproduction, reflecting the contribution of dynamic spatial cues caused by natural head movements and the listener’s own head-related transfer function to the perception of external space. For headphones, simulations with the highest ALOD reached comparable levels of externalization as observed for the dummy head recordings in the real room.

Author Contributions

Conceptualization, S.F. and S.D.E.; methodology, S.F. and S.D.E.; software, S.F.; validation, S.F.; formal analysis, S.F.; investigation, S.F. and S.D.E.; resources, S.D.E.; data curation, S.F.; writing—original draft preparation, S.F. and S.D.E.; writing—review and editing, S.F., S.D.E., S.v.d.P. and B.U.S.; visualization, S.F.; supervision, S.D.E. and S.v.d.P.; project administration, S.D.E. and B.U.S.; funding acquisition, S.D.E. and B.U.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project-ID 352015383—SFB 1330 C5.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

Data are available upon request.

Acknowledgments

The authors thank Christoph Kirsch for his support with the room acoustics simulations. During the preparation of this manuscript, the authors used OpenAI ChatGPT 4.o and 5.2 for language editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ALOD	Acoustic Level of Detail
AR	Augmented Reality
VR	Virtual Reality
LOD	Level of Detail
SI	Speech Intelligibility
BRIR	Binaural Room Impulse Response
CATT	Computer Aided Theater Technique
RAZR	Room Acoustics Simulator
HRTF	Head-Related Transfer Function
ISM	Image Source Model
VBAP	Vector Based Amplitude Processing
OLSA	Oldenburger Satz Test
ISTS	International Speech Test Signal
SRT	Speech Recognition Threshold
MUSHRA	Multi Stimulus Test with Hidden Reference and Anchor
SAQI	Spatial Audio Quality Inventory
rm-ANOVA	Repeated Measures Analysis of Variance

Appendix A

Table A1. Results of repeated-measures ANOVAs for all perceptual measures. The left panel includes headphone and loudspeaker presentations (excluding the Measured condition), and the right panel shows headphone data including Measured. Reported are degrees of freedom (df_M, df_R), F-values, p-values, and effect sizes (η²) for significant main effects and interactions.

ANOVA Data with Loudspeaker and Headphone Presentation Without						ANOVA Data with Headphone Presentation Including Measured
Factor	df_M	df_R	F	p	η²	Factor	df_M	df_R	F	p	η²
Distance						Distance
Environment	2	14	15.43	<0.001	0.69
Rendering	1.12	7.81	8.61	0.018	0.55	Rendering	1.625	11.373	7.528	<0.001	0.518
Environment × Rendering	2.19	15.33	38.54	<0.001	0.85	Environment × Rendering	8	56	10.511	<0.001	0.6
Envelopment						Envelopment
Pmode	1	7	13.17	0.008	0.65
Pmode × Rendering	1.02	7.16	8.56	0.021	0.55
Environment × Rendering	1.81	12.64	4.2	0.043	0.38
Metallic Tone Color						Metallic Tone Color
Environment	2	14	4.74	0.027	0.40	Environment	2	14	11.27	=0.001	0.62
Rendering	1.33	9.28	5.27	0.039	0.43	Environment × Rendering	2.27	15.9	25.56	<0.001	0.79
Environment × Rendering	1.72	12.01	47.08	<0.001	0.87
Reverberation						Reverberation
Environment	2	14	29.45	<0.001	0.81	Environment	2	14	17.923	<0.001	0.72
Rendering	1.32	9.23	5.35	0.038	0.43	Rendering	1.52	10.63	9.81	0.006	0.58
Pmode × Environment	2	14	6.6	0.01	0.49
Pmode × Rendering	1.16	8.1	6.93	0.027	0.5
Environment × Rendering	1.81	12.66	25.84	<0.001	0.79	Environment × Rendering	2.13	14.9	8.75	0.003	0.56
Tone Color						Tone Color
Environment	2	14	4.73	0.027	0.40
Source Width						Source Width
Environment	2	14	13.9	<0.001	0.67
Pmode × Environment	2	14	4.62	0.012	0.40
Environment × Rendering	1.944	13.61	29.13	<0.001	0.81
Externalization						Externalization
Pmode	1	7	9.39	0.018	0.57	Environment	1.11	7.8	6.012	0.038	0.46
Rendering	2	14	8.04	0.005	0.54	Rendering	3	21	24.943	<0.001	0.78
Environment × Rendering	1.555	10.88	8.11	0.010	0.54	Environment × Rendering	6	42	11.601	<0.001	0.62
Stimulus × Environment	2	14	3.95	0.044	0.36
Pmode × Rendering	2	14	5.3	0.019	0.43
Overall Difference
Stimulus	2	14	60.63	<0.001	0.90
Environment	2	14	237.90	<0.001	0.97
Rendering	1.61	11.23	470.88	<0.01	0.99
Pmode × Stimulus	2	14	7.74	0.005	0.53
Pmode × Environment	2	14	6.51	0.010	0.48
Stimulus × Environment	4	28	9.99	<0.001	0.59
Stimulus × Rendering	6	42	118.28	<0.001	0.94
Environment × Rendering	6	42	110.65	<0.001	0.94
Speech Intelligibility
Environment	1.206	8.44	54.081	<0.001	0.89
Rendering	3	21	146.483	<0.001	0.95
Environment × Rendering	2.362	16.54	12.712	<0.001	0.65
Plausibility
Rendering	1.034	7.24	34.08	<0.001	0.83
Stimulus × Rendering	1.219	8.53	28.31	<0.001	0.80
Environment × Rendering	4	28	3.75	0.014	0.35

References

Pedersen, R.L.; Picinali, L.; Kajs, N.; Patou, F. Virtual reality based research in hearing science: A platforming approach. J. Audio Eng. Soc. 2023, 71, 374–389. [Google Scholar] [CrossRef]
Fichna, S.; Biberger, T.; Seeber, B.U.; Ewert, S.D. Effect of acoustic scene complexity and visual scene representation on auditory perception in virtual audio-visual environments. In Proceedings of the 2021 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–9. [Google Scholar] [CrossRef]
Fichna, S.; Biberger, T.; Seeber, B.U.; Ewert, S.D. Effects of visual representation and scene complexity on speech perception, spatial hearing, and loudness in virtual environments. In Proceedings of the 2025 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar] [CrossRef]
Schütze, J.; Kirsch, C.; Kollmeier, B.; Ewert, S.D. Comparison of speech intelligibility in a real and virtual living room using loudspeaker and headphone presentations. Acta Acust. 2025, 9, 6. [Google Scholar] [CrossRef]
Chen, C.; Schissler, C.; Garg, S.; Kobernik, P.; Clegg, A.; Calamia, P.; Batra, D.; Robinson, P.W.; Grauman, K. SoundSpaces 2.0: A simulation platform for visual acoustic learning. arXiv 2022, arXiv:2206.08312. [Google Scholar]
Raghuvanshi, N.; Snyder, J. Parametric directional coding for precomputed sound propagation. ACM Trans. Graph. 2018, 37, 108. [Google Scholar] [CrossRef]
Jeong, C.H. Design, simulation, and virtual prototyping of room acoustics for building studies. J. Build. Perform. Simul. 2022, 15, 1–15. [Google Scholar]
Savioja, L.; Huopaniemi, J.; Lokki, T.; Väänänen, R. Creating interactive virtual acoustic environments. J. Audio Eng. Soc. 1999, 47, 675–705. [Google Scholar]
Seeber, B.U.; Kerber, S.; Hafter, E.R. A system to simulate and reproduce audio visual environments for spatial hearing research. Hear. Res. 2010, 260, 1–10. [Google Scholar] [CrossRef]
Schröder, D.; Vorländer, M. RAVEN: A real-time framework for the auralization of interactive virtual environments. In Proceedings of the Forum Acusticum, Aalborg, Denmark, 27 June–1 July 2011; Volume 27. [Google Scholar]
Wendt, T.; van de Par, S.; Ewert, S.D. A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation. J. Audio Eng. Soc. 2014, 62, 748–766. [Google Scholar] [CrossRef]
Hafter, E.; Seeber, B. The simulated open field environment for auditory localization research. In Proceedings of the 18th International Congress on Acoustics (ICA 2004), Kyoto, Japan, 4–9 April 2004. [Google Scholar]
Kirsch, C.; Wendt, T.; van de Par, S.; Hu, H.; Ewert, S.D. Computationally-efficient simulation of late reverberation for inhomogeneous boundary conditions and coupled rooms. J. Audio Eng. Soc. 2023, 71, 186–201. [Google Scholar] [CrossRef]
Scerbo, M.; Savioja, L.; De Sena, E. Room acoustic rendering networks with control of scattering and early reflections. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3745–3758. [Google Scholar] [CrossRef]
Allen, J.B.; Berkley, D.A. Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Krokstad, A.; Strom, S.; Sørsdal, S. Calculating the acoustical room response by the use of a ray tracing technique. J. Sound Vib. 1968, 8, 118–125. [Google Scholar] [CrossRef]
Jot, J.-M.; Chaigne, A. Digital Delay Networks for Designing Artificial Reverberators. Paris, France, 1 February 1991; p. 3030. Available online: https://secure.aes.org/forum/pubs/conventions/?elib=5663 (accessed on 22 January 2026).
Ewert, S.D.; Gößling, N.; Buttler, O.; van de Par, S.; Hu, H. Computationally-efficient rendering of diffuse reflections for geometrical acoustics based room simulation. Acta Acust. 2025, 9, 9. [Google Scholar] [CrossRef]
De Sena, E.; Haciihabiboglu, H.; Cvetkovic, Z.; Smith, J.O. Efficient synthesis of room acoustics via scattering delay networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 1478–1492. [Google Scholar] [CrossRef]
Scerbo, M.; Schlecht, S.J.; Ali, R.; Savioja, L.; De Sena, E. Modeling nonuniform energy decay through the modal decomposition of acoustic radiance transfer (MoD-ART). IEEE Trans. Audio Speech Lang. Process. 2025, 33, 3363–3376. [Google Scholar] [CrossRef]
Brinkmann, F.; Aspöck, L.; Ackermann, D.; Lepa, S.; Vorländer, M.; Weinzierl, S. A round robin on room acoustical simulation and auralization. J. Acoust. Soc. Am. 2019, 145, 2746–2760. [Google Scholar] [CrossRef]
Stärz, F.; van de Par, S.; Roßkopf, S.; Kroczek, L.O.H.; Mühlberger, A.; Blau, M. Comparison of binaural auralisations to a real loudspeaker in an audiovisual virtual classroom scenario: Effect of room acoustic simulation, HRTF dataset, and head-mounted display on room acoustic perception. Acta Acust. 2025, 9, 31. [Google Scholar] [CrossRef]
Lindau, A.; Weinzierl, S. Assessing the plausibility of virtual acoustic environments. Acta Acust. United Acust. 2012, 98, 804–810. [Google Scholar] [CrossRef]
Houtgast, T.; Steeneken, H.J.; Plomp, R. Predicting speech intelligibility in rooms from the modulation transfer function. I. General room acoustics. Acta Acust. United Acust. 1980, 46, 60–72. [Google Scholar]
Plomp, J.; Mimpen, A.M. Improving the reliability of testing the speech reception threshold for sentences. Audiology 1979, 18, 43–52. [Google Scholar] [CrossRef]
Hagerman, B. Sentences for testing speech intelligibility in noise. Scand. Audiol. 1982, 11, 79–87. [Google Scholar] [CrossRef] [PubMed]
Toole, F.E. Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, 2nd ed.; Focal Press: Oxford, UK, 2008. [Google Scholar]
Bech, S.; Zacharov, N. Perceptual Audio Evaluation–Theory, Method and Application; John Wiley & Sons: Chichester, UK, 2006. [Google Scholar]
Jalil, N.A.A.; Din, N.C.; Keumala, N.; Razak, A.S. Effect of model simplification through manual reduction in number of surfaces on room acoustics simulation. J. Des. Built Environ. 2019, 19, 31–41. [Google Scholar] [CrossRef]
Blau, M.; Budnik, A.; Fallahi, M.; Steffens, H.; Ewert, S.D.; van de Par, S. Toward realistic binaural auralizations—Perceptual comparison between measurement and simulation-based auralizations and the real room for a classroom scenario. Acta Acust. 2021, 5, 8. [Google Scholar] [CrossRef]
Steffens, H.; Schutte, M.; Ewert, S.D. Acoustically driven orientation and navigation in enclosed spaces. J. Acoust. Soc. Am. 2022, 152, 1767–1782. [Google Scholar] [CrossRef]
Steffens, H.; Schutte, M.; Ewert, S.D. Auditory orientation and distance estimation of sighted humans using virtual echolocation with artificial and self-generated sounds. JASA Express Lett. 2022, 2, 124403. [Google Scholar] [CrossRef]
Axelsson, Ö.; Nilsson, M.E.; Berglund, B. A principal components model of soundscape perception. J. Acoust. Soc. Am. 2010, 128, 2836–2846. [Google Scholar] [CrossRef]
Aletta, F.; Kang, J.; Axelsson, Ö. Soundscape descriptors and a conceptual framework for developing predictive soundscape models. Landsc. Urban Plan. 2016, 149, 65–74. [Google Scholar] [CrossRef]
ISO 12913-1:2014; Acoustics—Soundscape—Part 1: Definition and Con-Ceptual Framework. ISO: Geneva, Switzerland, 2014.
Torresin, S.; Aletta, F.; Oberman, T.; Vinciotti, V.; Albatici, R.; Kang, J. Measuring, representing and analysing indoor soundscapes: A data collection campaign in residential buildings with natural and mechanical ventilation in England. Build. Environ. 2023, 243, 110726. [Google Scholar] [CrossRef]
Jeon, J.Y.; Jo, H.I. Effects of audio-visual interactions on soundscape and landscape perception and their influence on satisfaction with the urban environment. Build. Environ. 2020, 169, 106544. [Google Scholar] [CrossRef]
Fusaro, G.; Kang, J.; Chang, W.-S. Effective soundscape characterisation of an acoustic metamaterial based window: A comparison between laboratory and online methods. Appl. Acoust. 2022, 193, 108754. [Google Scholar] [CrossRef]
Cartwright, M.; Pardo, B.; Bello, J.P. Crowdsourcing perceptual audio evaluation. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 619–623. [Google Scholar] [CrossRef]
Woods, K.J.P.; Siegel, M.H.; Traer, J.; McDermott, J.H. Headphone screening to facilitate web-based auditory experiments. Atten. Percept. Psychophys. 2017, 79, 2064–2072. [Google Scholar] [CrossRef] [PubMed]
Fichna, S.; Kirsch, C.; Seeber, B.U.; Ewert, S.D. Perceptual evaluation of simulated and real acoustic scenes with different acoustic level of detail. In Proceedings of the 24th International Congress on Acoustics, Gyeongju, Republic of Korea, 24–28 October 2022. [Google Scholar]
Fichna, S.; van de Par, S.; Ewert, S.D. Evaluation of virtual acoustic environments with different acoustic level of detail. In Proceedings of the 2023 Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 5–7 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Martin, V.; Engel, I.; Picinali, L. Effects of geometrical acoustics model simplification on binaural reverberation. IEEE Trans. Audio Speech Lang. Process. 2025, 33, 3480–3493. [Google Scholar] [CrossRef]
Dalenbäck, B.I. Whitepaper: What Is Geometrical Acoustics; Tech. Rep.; CATT: Gothenburg, Sweden, 2021; Available online: https://www.catt.se/What_is_Geometrical_Acoustics.pdf (accessed on 22 January 2026).
Lindau, A.; Erbes, V.; Lepa, S.; Maempel, H.-J.; Brinkman, F.; Weinzierl, S. A spatial audio quality inventory (SAQI). Acta Acust. United Acust. 2014, 100, 984–994. [Google Scholar] [CrossRef]
van de Par, S.; Ewert, S.D.; Hladek, L.; Kirsch, C.; Schütze, J.; Llorca-Bofí, J.; Grimm, G.; Hendrikse, M.M.; Kollmeier, B.; Seeber, B.U. Auditory-visual scenes for hearing research. Acta Acust. 2022, 6, 55. [Google Scholar] [CrossRef]
Kirsch, C.; Ewert, S.D. Filter-based first- and higher-order diffraction modeling for geometrical acoustics. Acta Acust. 2024, 8, 73. [Google Scholar] [CrossRef]
Kirsch, C.; Ewert, S.D. Effects of measured and simulated diffraction from a plate on sound source localization. J. Acoust. Soc. Am. 2024, 155, 3118–3131. [Google Scholar] [CrossRef]
Lokki, T.; Pätynen, J.; Kuusinen, A.; Tervo, S. Concert halls with good acoustics for symphonic music. Acoust. Aust. 2011, 39, 14–19. [Google Scholar]
Neidhardt, A.; Schneiderwind, C.; Klein, F. Perceptual matching of room acoustics for auditory augmented reality in small rooms—Literature review and theoretical framework. Trends Hear. 2022, 26. [Google Scholar] [CrossRef]
Lavandier, M.; Culling, J.F. Prediction of binaural speech intelligibility against noise in rooms. J. Acoust. Soc. Am. 2010, 127, 387–399. [Google Scholar] [CrossRef]
Bradley, J.S.; Sato, H.; Picard, M. On the importance of early reflections for speech in rooms. J. Acoust. Soc. Am. 2003, 113, 3233–3244. [Google Scholar] [CrossRef]
Srinivasan, N.K.; Stansell, M.; Gallun, F.J. The role of early and late reflections on spatial release from masking: Effects of age and hearing loss. J. Acoust. Soc. Am. 2017, 141, EL185–EL191. [Google Scholar] [CrossRef] [PubMed]
Zahorik, P. Assessing auditory distance perception using virtual acoustics. J. Acoust. Soc. Am. 2002, 111, 1832–1846. [Google Scholar] [CrossRef] [PubMed]
Rumsey, F. Spatial quality evaluation for reproduced sound: Terminology, meaning, and a scene-based paradigm. J. Audio Eng. Soc. 2002, 50, 651–666. Available online: https://www.aes.org/e-lib/browse.cfm?elib=11073 (accessed on 22 January 2026).
Hartmann, W.M.; Wittenberg, A. On the externalization of sound images. J. Acoust. Soc. Am. 1996, 99, 3678–3688. [Google Scholar] [CrossRef]
Begault, D.R.; Wenzel, E.M.; Anderson, M.R. Direct comparison of the impact of head tracking, reverberation, and individualized head-related transfer functions on the spatial perception of a virtual speech source. J. Acoust. Soc. Am. 2001, 49, 904–916. Available online: https://www.aes.org/e-lib/browse.cfm?elib=10088 (accessed on 22 January 2026).
Kreiman, J.; Gerratt, B.R.; Precoda, K. Listener experience and perception of voice quality. J. Speech Lang. Hear. Res. 1990, 33, 103–115. [Google Scholar] [CrossRef]
Rathnayake, R.A.; Wanniarachchi, W.K.I.L. Image source method based acoustic simulation for 3-D room environment. Int. J. Sci. Technol. Res. 2019, 8, 222–228. Available online: https://www.ijstr.org/final-print/nov2019/Image-Source-Method-Based-Acoustic-Simulation-For-3-d-Room-Environment.pdf (accessed on 22 January 2026).
Scheibler, R.; Bezzam, E.; Dokmanic, I. Pyroomacoustics: A Python package for audio room simulation and array processing algorithms. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 351–355. [Google Scholar] [CrossRef]
Delabie, D.; Buyle, C.; Cox, B.; van der Perre, L.; De Strycker, L. An acoustic simulation framework to support indoor positioning and data driven signal processing assessments. In Proceedings of the 2023 31st European Signal Processing Conference (EUSIPCO), Helsinki, Finland, 4–8 September 2023; pp. 261–265. [Google Scholar] [CrossRef]
Braren, H.S.; Fels, J. Towards child-appropriate virtual acoustic environments: A database of high-resolution HRTF measurements and 3D-scans of children. Int. J. Environ. Res. Public Health 2021, 19, 324. [Google Scholar] [CrossRef]
Grimm, G.; Hendrikse, M.; Hohmann, V. Pub environment [Data set]. Zenodo 2021. [Google Scholar] [CrossRef]
Hládek, H.; Ewert, S.D.; Seeber, B.U. Communication conditions in virtual acoustic scenes in an underground station. In Proceedings of the Immersive and 3D Audio: From Architecture to Automotive (I3DA), Bologna, Italy, 8–10 September 2021; pp. 1–8. [Google Scholar] [CrossRef]
Pulkki, V. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc. 1997, 45, 456–466. [Google Scholar]
Wagener, K.; Brand, T.; Kollmeier, B. Entwicklung und Evaluation eines Satztests für die deutsche Sprache II: Optimierung des Oldenburger Satztests [Development and evaluation of a sentence test for the German language II: Optimization of the Oldenburg sentence test]. Z. Audiol./Audiol. Acoust. 1999, 38, 44–56. Available online: https://www.researchgate.net/publication/266735660_Entwicklung_und_Evaluation_eines_Satztests_fur_die_deutsche_Sprache_I_Design_des_Oldenburger_Satztests (accessed on 22 January 2026).
Kirsch, C.; Poppitz, J.; Wendt, T.; van de Par, S.; Ewert, S.D. Spatial resolution of late reverberation in virtual acoustic environments. Trends Hear. 2021, 25, 23312165211054924. [Google Scholar] [CrossRef] [PubMed]
Schubotz, W.; Brand, T.; Kollmeier, B.; Ewert, S.D. Monaural speech intelligibility and detection in maskers with varying amounts of spectro-temporal speech features. J. Acoust. Soc. Am. 2016, 140, 524–540. [Google Scholar] [CrossRef] [PubMed]
Holube, I.; Fredelake, S.; Vlaming, M.; Kollmeier, B. Development and analysis of an international speech test signal (ISTS). Int. J. Audiol. 2010, 49, 891–903. [Google Scholar] [CrossRef]
Denk, F.; Kohnen, M.; Llorca-Bofí, J.; Vorländer, M.; Kollmeier, B. The “Missing 6 dB” revisited: Influence of room acoustics and binaural parameters on the loudness mismatch between headphones and loudspeakers. Front. Psychol. 2021, 12, 623670. [Google Scholar] [CrossRef]
Farina, A. Simultaneous measurement of impulse response and distortion with a swept-sine technique. In Proceedings of the Audio Engineering Society Convention 108, Paris, France, 19–22 February 2000; p. 5093. [Google Scholar]
Ewert, S.D. AFC—A modular framework for running psychoacoustic experiments and computational perception models. In Proceedings of the International Conference on Acoustics AIA-DAGA, Berlin, Germany, 18–21 March 2013; Volume 2013, pp. 1326–1329. [Google Scholar]
ITU-R BS.1534-3; Method for the Subjective Assessment of Intermediate Quality Level of Audio Systems. International Telecommunication Union, Radiocommunication Sector (ITU-R): Geneva, Switzerland, 2014.
Abel, J.S.; Huang, P. A simple, robust measure of reverberation echo density. In Proceedings of the 121st Audio Engineering Society Convention, San Francisco, CA, USA, 5–8 October 2006; p. 6985. [Google Scholar]
Catic, J.; Santurette, S.; Dau, T. The role of reverberation-related binaural cues in the externalization of speech. J. Acoust. Soc. Am. 2015, 138, 1154–1167. [Google Scholar] [CrossRef]
Laitinen, M.-V.; Politis, A.; Huhtakallio, I.; Pulkki, V. Controlling the perceived distance of an auditory object by manipulation of loudspeaker directivity. J. Acoust. Soc. Am. 2015, 137, EL462–EL468. [Google Scholar] [CrossRef]
Hendrickx, E.; Stitt, P.; Lyzenga, J.; van de Par, S. The role of coherence in the externalization of auditory images. J. Acoust. Soc. Am. 2017, 141, 2012–2024. [Google Scholar] [CrossRef]
Mannall, J.J.; Calamia, P.; Savioja, L.; Neidhardt, A.; Mason, R.D.; De Sena, E. Assessing diffraction perception under reverberant conditions in virtual reality. In Proceedings of the International Conference on Audio for Virtual and Augmented Reality (AVAR), Redmond, WA, USA, 19–21 August 2024. [Google Scholar]
Ewert, S.D. A filter representation of diffraction at infinite and finite wedges. JASA Express Lett. 2022, 2, 092401. [Google Scholar] [CrossRef]
Ewert, S.D.; Kirsch, C. Filter-based solutions for finite and infinite edge diffraction. J. Acoust. Soc. Am. 2023, 154, A44. [Google Scholar] [CrossRef]
Berg, J.; Rumsey, F. Identification of perceived spatial attributes of recordings by repertory grid technique and other methods. In Proceedings of the 16th Audio Engineering Society Conference on Spatial Sound Reproduction, Rovaniemi, Finland, 10–12 April 1999. [Google Scholar]
D’Orazio, D.; Fratoni, G.; Rovigatti, A.; Hamilton, B. Numerical simulations of Italian opera houses using geometrical and wave-based acoustics methods. In Proceedings of the 23rd International Congress on Acoustics (ICA 2019), Aachen, Germany, 9–13 September 2019; pp. 5994–5996. [Google Scholar]
Lai, H.; Hamilton, B. Computer modeling of barrel-vaulted sanctuary exhibiting flutter echo with comparison to measurements. Acoustics 2020, 2, 87–109. [Google Scholar] [CrossRef]
Aretz, M.; Nöthen, R.; Vorländer, M.; Schröder, D. Combined broadband impulse responses using FEM and hybrid ray-based methods. In Proceedings of the EAA Symposium on Auralization, Espoo, Finland, 15–17 June 2009; pp. 15–17. [Google Scholar]

Figure 1. The three depicted settings, from left to right, in the Living Room, the Pub, and the Underground Station. The upper row shows the visual rendering generated by Unreal Engine. The lower row shows the floor plans, including an overview and zoomed-in areas for the Pub and Underground Station. Red rectangles highlight the zoomed-in areas in the floor plans, and the eye symbol marks the camera viewpoint from which the upper-row images were taken. The positions of the target speaker (green circle), masker (red circle), and listener position (black circle) are indicated in the bottom row.

Figure 2. Plausibility ratings for the three environments (Living Room, Pub, Underground Station) obtained with headphone (HP) presentation. The upper row shows results for the Speech stimulus and the lower row for the Pulse stimulus. Colors indicate the rendering condition (green: Measured, red: RAZR, blue: ISM). Individual data are indicated by the circles. The box plots show the median, interquartile range, and the overall range of the data. Asterisks denote significant pairwise differences within each Environment and Stimulus, using post-hoc comparisons following the rm-ANOVA. ***: p < 0.001 The highest plausibility is observed for Measured and RAZR in the Pub.

Figure 3. Speech reception thresholds (SRTs) expressed as signal-to-noise ratio (dB SNR) for the three environments (Living Room, Pub, Underground Station) obtained with headphone presentation and a co-located masker. Colors indicate rendering condition (green, Measured; red, RAZR; blue, ISM; orange, Anechoic). Box plots show the median, interquartile, and overall ranges. The circles show individual SRTs. Asterisks indicate significant differences within each environment using post-hoc comparisons following the rm-ANOVA *: p < 0.05 **: p < 0.01 ***: p < 0.001. Overall, the lowest SRT is observed in the Anechoic control condition. In the Pub and Underground, similar thresholds were obtained for Measured and RAZR.

Figure 4. Mean ratings of overall perceived difference relative to the RAZR reference, shown separately for Headphone (upper three rows) and Loudspeaker (lower three rows) presentation and for each stimulus (Speech, Pulse, Bass). Colors indicate rendering condition (green: Measured, headphones only; red, RAZR reference; blue, ISM; gray, RAZR-Simple; cyan, RAZR-1st-Order). A rating of 0 corresponds to “no perceptible difference from the reference,” while 100 indicates a “very large difference.” Box plots illustrate rating distributions; circles denote individual listener averages. Black asterisks indicate significant deviations from the RAZR reference within each condition based on post-hoc comparisons following a four-way repeated-measures ANOVA; red asterisks denote significant differences between Measured and RAZR for headphone presentation only. Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Measured and ISM exhibit the highest difference ratings, while reduced RAZR variants cluster near the reference.

Figure 5. Spatial audio quality ratings for headphone (left) and loudspeaker (right). Each row represents one sound quality item, and each column corresponds to one environment (Living Room, Pub, Underground Station). Colors indicate rendering condition (green, Measured, headphones only; red, RAZR reference; blue, ISM; gray, RAZR-Simple; cyan, RAZR-1st-Order). A rating of 50 corresponds to “no perceptual difference from the reference.” Box plots show the median, interquartile and overall ranges; circles denote individual listener averages. Asterisks indicate significant deviations from the RAZR reference within each environment and presentation mode, based on post-hoc comparisons following repeated-measures ANOVAs performed separately for each sound quality item. Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Largest deviations occur for ISM, while Measured and RAZR-1st-Order remain closest to the reference across most items.

Figure 6. Externalization ratings for headphone (upper two panels; HP) and loudspeaker (lower two panels; LS) in the same style as in Figure 3. Results are shown separately for Speech and Pulse stimuli. Columns correspond to the three environments (Living Room, Pub, Underground Station). Colors indicate rendering condition (green, Measured, headphones only; red, RAZR; blue, ISM; yellow, Diotic). The vertical scale represents perceived sound location ranging from internalized (“close to the head”) to externalized. Box plots show the median, interquartile, and overall ranges; circles denote individual listener averages. Asterisks indicate significant pairwise differences between renderings within each environment, stimulus, and presentation mode, based on post-hoc comparisons following repeated-measures ANOVAs. Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001. Measured shows the highest externalization for headphone playback, ISM is consistently reduced, and Diotic exhibits strong internalization for headphones but markedly higher externalization for loudspeakers.

Table 1. Geometric properties and reverberation times (T30) of the three real-world environments used as references for dummy-head recordings and room-acoustic simulations.

Environment	Primary Dimensions (m)	Primary Volume (m³)	Coupled Volumes	Reverberation Time T30 (s)
Living Room	4.97 × 3.78 × 2.71	50.9	Kitchen, 26.9 m³ (open door)	Living Room: 0.54 Kitchen: 0.66
Pub	17.67 × 10.2 × 2.9	≈442	None	0.7
Underground Station	120 × 15.7 × 4.16	≈11,000	Tunnels and escalator hall	1.6 (dual slope decay)

Table 2. Overview of the rendering conditions and corresponding acoustic modeling features used in the present study. The RAZR-based renderings employ image source modeling combined with scattering and feedback delay network (FDN) reverberation, following the general framework described in [11,13].

Rendering	Early Reflections	ISM Order	Scattering /Jittering	Coupled Volumes	Late Reverberation	Geometric Simplifications	Notes
Measured BRIR	Measured	-	-	Included	Measured	Real geometry	-
RAZR	Proxy shoebox ISM	3rd order	Yes	Yes	FDN	None	Highest simulated ALOD
RAZR 1st-Order	Proxy shoebox ISM	1st order	Yes	Yes	FDN	None	Reduced early reflection detail
RAZR-Simple	Proxy shoebox ISM	3rd order	Yes		FDN	Environment- specific removals	Simplified geometry
ISM	Shoebox ISM	15th order	No	No	None	Shoebox only	Used as anchor
Diotic	Proxy shoebox ISM	3rd order	Yes	Yes	FDN	None	Control (no spatial cues)
Anechoic	Direct sound only	-	-	-	None	-	SI control condition

Table 3. Spatial audio quality items used in this study, as described in Lindau et al. [45]. The terms in the middle row denote the meaning of a score of 0 whereas the terms in the right row denote the meaning of a score of 100, related to a comparison between the signal being evaluated and the reference signal.

Spatial Audio Quality Item	Score of 0	Score of 100
Overall difference	None	Very large
Distance	Closer	More distant
Tone color	Darker	Brighter
Horizontal direction	Shifted clockwise	Shifted anticlockwise
Metallic tone color	More pronounced	Less pronounced
Reverberation	More	Less
Envelopment	More pronounced	Less pronounced
Source width	Wider	Less wide

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fichna, S.; van de Par, S.; Seeber, B.U.; Ewert, S.D. Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments. Acoustics 2026, 8, 9. https://doi.org/10.3390/acoustics8010009

AMA Style

Fichna S, van de Par S, Seeber BU, Ewert SD. Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments. Acoustics. 2026; 8(1):9. https://doi.org/10.3390/acoustics8010009

Chicago/Turabian Style

Fichna, Stefan, Steven van de Par, Bernhard U. Seeber, and Stephan D. Ewert. 2026. "Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments" Acoustics 8, no. 1: 9. https://doi.org/10.3390/acoustics8010009

APA Style

Fichna, S., van de Par, S., Seeber, B. U., & Ewert, S. D. (2026). Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments. Acoustics, 8(1), 9. https://doi.org/10.3390/acoustics8010009

Article Menu

Perceptual Evaluation of Acoustic Level of Detail in Virtual Acoustic Environments

Abstract

1. Introduction

2. Methods

2.1. Participants

2.2. Acoustic Scenes

2.3. Room Acoustic Simulation

2.4. Stimuli

2.5. Apparatus and Procedure

2.6. Statistical Analysis

3. Results

3.1. Plausibility

3.2. Speech Intelligibility

3.3. Overall Difference

3.4. Spatial Audio Quality Items

3.5. Externalization

4. Discussion

4.1. Role of ALOD Across Different Perceptual Measures and Stimuli

4.2. Role of Environment

4.3. Role of Presentation Mode

4.4. Comparison to Geometry Decimation and Outlook

4.5. Methodological Limitations and Generalizability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI