1. Introduction
Room acoustics simulation and virtual acoustics are widely applied, ranging from generating highly controllable complex acoustic environments for hearing research and audiology [
1,
2,
3,
4] to entertainment and video game sound design [
5,
6], augmented reality (AR) or virtual reality (VR), as well as architectural planning [
7]. Several room acoustics simulation approaches exist [
8,
9,
10,
11,
12,
13,
14], each involving simplifications of the underlying acoustical processes. Many algorithms [
10,
11] simulate early reflections using the image source model (ISM) [
15], combined with an alternative method to simulate late reverberation, such as ray tracing [
16], feedback delay networks [
17] extended to spatial rendering [
11,
13,
18], scattering delay networks [
19], and acoustic radiance transfer [
14,
20]. State-of-the-art methods achieve a high degree of perceptual plausibility [
21,
22], defined as “a simulation in agreement with the listener’s expectation towards a corresponding real event” according to Lindau and Weinzierl [
23].
With regard to interactive real-time applications, it is of particular interest to reduce the computational demands of room acoustics simulation while maintaining perceptual plausibility and achieving a certain agreement in spatial audio quality and speech intelligibility (SI) between simulation and the respective real-life environment. For this, the ALOD of the simulated environment might be reduced if perceptual differences remain below established perceptual relevance thresholds reported in the literature. For example, changes in speech intelligibility on the order of approximately 1 dB in speech reception threshold are commonly regarded as perceptually meaningful, e.g., [
24,
25], whereas smaller deviations are often considered negligible. Similarly, regarding audio quality, studies suggest that broadband spectral deviations of about 1 dB or less typically remain below just-noticeable difference thresholds under many listening conditions, e.g., [
26,
27].
The term ALOD is inspired by level of detail (LOD) as commonly used in computer graphics, to denote typically distance-dependent reductions in visual geometric model complexity (polygon count) while maintaining the same perceptual impression. Prior studies have examined the influence of ALOD on acoustic parameters such as reverberation time and the speech transmission index [
28]. Abd Jalil et al. [
29] demonstrated that the number of surfaces in acoustic models can be reduced by up to 80% without compromising accuracy. Moreover, it has been shown that even when simplifying more complex room shapes to a proxy rectangular (“shoebox”) geometry [
11,
13,
18], it is possible to maintain perceptual plausibility and achieve close alignment with measured reference conditions [
21,
22,
30]. This level of simplification not only facilitates low-latency, interactive virtual acoustic environments for hearing research but also allows for the possibility of running realistic spatial rendering on mobile hardware, which is advantageous for fitting hearables and hearing support systems [
31,
32]. To ensure broad applicability of such simulations, it is important to evaluate perceptual outcomes using both headphone-based and loudspeaker-based reproduction methods. While headphone reproduction allows precise control over binaural cues, loudspeaker setups may be more suitable in scenarios where head tracking is unavailable or when simulating open-fit hearing devices. It has been shown [
4] that shoebox simulations can yield good agreement with real-room references in terms of SI under both rendering approaches. However, it remains unclear whether certain simplifications in the acoustic model, such as omitting nearby reflective surfaces or coupled volumes, impact perception differently depending on the reproduction system.
Beyond room-acoustic simulation research, perceptual evaluation of complex acoustic environments has been extensively addressed in the field of soundscape research. Soundscape studies explicitly focus on perceptual, affective, and contextual dimensions of acoustic environments and commonly rely on structured listening experiments to assess attributes such as realism, comfort, spatial impression, and overall environmental quality [
33,
34,
35]. Within this research area, indoor soundscapes and everyday listening situations have received increasing attention. For example, Toressin et al. [
36] investigated indoor soundscapes under different ventilation conditions in residential buildings. Audio–visual interactions have also been shown to play a significant role in shaping soundscape perception and environmental satisfaction, as demonstrated by Jeon et al. [
37] in an urban context.
Methodologically, a variety of headphone- and headset-based reproduction approaches have been employed in soundscape and immersive audio research to enable perceptual evaluation across different experimental settings. Fusaro et al. [
38] compared laboratory-based and online soundscape assessment methods using headphone reproduction, demonstrating that perceptual soundscape characterization can be obtained across different study formats when listening conditions and playback devices are documented and appropriately accounted for. Related work has further shown that perceptual audio evaluation can be extended beyond strictly controlled laboratory environments using online or hybrid listening-test methodologies, allowing for larger and more diverse listener samples [
39,
40].
These soundscape-oriented approaches typically aim to capture global perceptual impressions and contextual responses in everyday audio-visual scenarios. In contrast, the present study adopts a complementary perspective by focusing on controlled manipulations of acoustic simulation parameters, enabling a systematic investigation of how specific reductions in ALOD affect perceptual quality and task-based performance under defined reproduction conditions.
Fichna et al. [
41] investigated the perception of simulated and real acoustic scenes with varying ALOD using headphones. Their preliminary study examined SI in conditions with a masking signal and evaluated the overall perceived spatial audio quality using different ALOD in the acoustic simulation in comparison with recordings for a spectrally pink-colored pulse as a target signal. However, the study was based on only three participants (N = 3) and, therefore, served primarily as a proof-of-concept rather than allowing statistically robust or generalizable conclusions. Fichna et al. [
42] extended this research by systematically evaluating plausibility, the overall perceived difference between the renderings, and the perceived externalization with headphones and loudspeakers. They used speech as well as a pink-colored pulse as target sounds. Their data showed that simulations with a maximum ALOD were as plausible as those generated using measured binaural room impulse responses (BRIRs), even with the transient pulse stimuli, which generally provide the highest sensitivity for perceptual differences in room acoustics simulation (see, e.g., [
18]). Despite high plausibility ratings observed in [
42], a considerable perceived overall difference between simulated and measured conditions remained. However, it remains unclear to which perceptual attributes these differences are related, and an overall picture including SI, plausibility, and spatial audio quality attributes is missing. Conclusions regarding SI are limited by the relatively small number of subjects in [
41].
Recently, Martin et al. [
43] investigated the implications of simplifying geometrical acoustics by reducing a highly resolved surface model of a real home environment to a shoebox representation. Using the CATT Acoustics software [
44], which combines image-source and ray-tracing methods, their results suggested that geometric decimation had less impact on the perception of reverberation than reductions in the frequency resolution of absorption coefficients. In addition to plausibility and SI, perceptual dimensions, such as spatial audio quality and externalization, are essential for evaluating the realism of room-acoustic simulations. Spatial audio quality encompasses attributes like source distance, reverberation, and timbral coloration (see [
45]), while externalization describes the extent to which sounds are perceived as located outside the listener’s head.
Taken together, previous studies demonstrate that considerable simplifications in acoustic geometry and simulation detail can be perceptually acceptable, particularly with regard to plausibility and SI. However, prior work has typically focused on individual environments, specific target signals, or isolated reproduction methods, limiting the ability to generalize findings across settings. Moreover, existing data are often based on separate listener groups and narrow perceptual attributes, making it difficult to assess how ALOD reductions affect different reproduction methods and room types in a unified framework.
The present study addresses this gap by systematically evaluating perceptual consequences of ALOD across three virtual environments using both headphone- and loudspeaker-based reproduction within a single group of participants. Specifically, we extend our data collected in [
42] by testing SI and spatial audio quality in the same group of listeners and conditions. Additionally, we extend the stimuli to include a music token for perceived overall difference, and commonly analyze newly recorded and existing data in the context of the resulting broad range of perceptual measures: A comprehensive repeated-measures analysis across environments, rendering conditions, stimuli, and reproduction methods provides a statistically grounded basis for assessing systematic dependencies and interactions. Integration of these methodological extensions into a coherent interpretative framework enables a more differentiated examination and discussion of the perceptual consequences of acoustic model simplifications over [
42]. The resulting study design allows for direct comparisons of ALOD effects across simulation setups and supports a better understanding of which simplifications can be tolerated without compromising perceptual validity.
Given that a reduction of ALOD may affect different perceptual dimensions in different ways, we employ a set of complementary perceptual measures to assess both functional and perceptual consequences of room-acoustic model simplifications: Speech intelligibility represents a task-based, functional measure that is particularly relevant for hearing research and applications involving communication in everyday acoustic environments. Plausibility was included as a global judgment reflecting whether a simulated scene meets listeners’ expectations of a corresponding real environment, while an overall perceived difference measure was employed as a more stringent test of perceptual fidelity, targeting authenticity in the sense of how closely a rendering matches a real-room reference rather than whether it merely appears plausible. To interpret potential overall differences in a more diagnostic manner, spatial audio perception was further assessed using defined sound quality items (e.g., timbral coloration, reverberation-related impressions, spatial clarity/width, and envelopment). Externalization was included as a key spatial percept that is known to be sensitive to inaccuracies in binaural and room-acoustic cues and is, therefore, particularly informative for evaluating spatial realism under different reproduction methods.
Three distinct and acoustically diverse environments were investigated: BRIRs recorded in a living room coupled with a kitchen, a pub, and an underground station [
46] were used as references for headphone-based auralization. Additionally, to assess the reproduction method, perceptual assessments of spatial audio quality and externalization were conducted using a three-dimensional loudspeaker array in addition to headphone reproduction. The Room Acoustics SimulatoR (RAZR) [
11,
13] was employed to generate synthetic BRIRs and loudspeaker reproductions.
The role of ALOD has so far often been studied using decimation of the underlying geometric model, e.g., [
29,
43]. While this modulates several properties of the room acoustics simulation simultaneously, such as the number and spatial direction of early reflections, the temporal increase in echo density as well as spatial diffusion and reverberation time, the current approach aims to modulate specific, potentially important features of the simulated room impulse response in an isolated and controlled way. First, we start off with a considerably lower ALOD regarding the geometrical detail than the above studies, given that the underlying room geometry is a “proxy” rectangular (shoebox) room. With our highest ALOD model, we do not intend to model all geometrical details; however, the model has been fitted to match the frequency-dependent reverberation time and the long-term spectrum of the targeted real-world (B)RIR, important known perceptual features for comparing room acoustics. Moreover, the model has been demonstrated earlier, e.g., [
18], to yield a realistic increase in echo density by incorporating sound scattering and natural-sounding (spatial) late reverberation. In order to reduce ALOD, we specifically vary selected room simulation features, such as the presence of scattering or dual-slope decays, or the inclusion of early reflections from nearby geometrical structures. Thus, ALOD was defined from a more perceptual perspective in the current study, rather than from a geometrical one defined by polygon count. The goal was to determine how ALOD can be reduced to the lowest perceptually acceptable level, depending on source stimulus and acoustic environment, starting from an already simplified and computationally efficient model. The highest ALOD condition included all available RAZR features, such as simplified effects of scattering and diffraction [
18], while lower ALOD levels were created by successively omitting these components. Additionally, the effects of simulating nearby finite reflective surfaces, e.g., [
47,
48], and coupled volumes [
13] were assessed, depending on the presence of that specific feature in the environment. This perceptually motivated the definition of ALOD which provides a systematic framework for assessing how different aspects of acoustic modeling complexity influence perceptual outcomes.
To motivate the choice of stimuli and perceptual tasks in this study, it is important to consider that different auditory attributes may vary in their sensitivity to changes in room-acoustic modeling fidelity. Previous research has shown that stimuli with rich spatial or spectral characteristics, such as music signals or transient noise bursts, tend to be more sensitive to rendering inaccuracies than typical speech signals, which are dominated by temporal envelope cues and linguistic redundancy [
49,
50]. This suggests that the perceptual detectability of reduced ALOD depends not only on the simulation method but also on the acoustic features of the source stimulus.
In addition, perceptual dimensions related to spatial quality, such as externalization and spatial clarity, were expected to place greater demands on acoustic modeling detail than task-based performance measures like speech intelligibility [
23,
51]. Speech, a transient pulse, and a spectrally rich musical stimulus are employed to probe stimulus-dependent sensitivity to ALOD, with the expectation that speech is comparatively robust, whereas broadband and transient signals reveal increased sensitivity to modeling simplifications. Together, this stimulus selection enables a differentiated investigation of how reductions in ALOD interact with source characteristics and perceptual outcome measures.
The study focuses on four main research topics:
(i) Effect of ALOD variation on SI. It is expected that SI will remain largely stable across different ALOD levels, as long as early reflections and diffuse reverberation are sufficiently represented [
2,
4,
29,
52]. However, excessive reductions in ALOD may degrade SI due to insufficient representation of early reflections that improve effective signal-to-noise ratio [
53].
(ii) Effect of source stimulus on ALOD. Because different stimuli emphasize distinct acoustic cues, it is expected that transient and broadband signals (e.g., pulses) will reveal rendering differences more clearly than spectrally limited or speech-like signals, which may remain perceptually stable under reduced ALOD [
28,
54].
(iii) Contribution of specific spatial audio quality items on differences in perceived overall sound quality. Differences could primarily be influenced by spectral coloration, reverberation characteristics, and spatial coherence, cf. [
55].
(iv) Effect of playback method (headphones vs. loudspeakers). It is expected that loudspeaker playback yields higher externalization ratings than headphone playback due to additional binaural and room cues, cf. [
56,
57]. However, in headphone-based simulations, externalization can still be influenced by factors such as the realism of the simulated room acoustics, including the number and spatial distribution of reflections, the balance between direct sound and early reflections, and the accuracy of spatial cues rendered through Head-Related Transfer Functions (HRTFs).
3. Results
3.1. Plausibility
The results of the plausibility test are shown in
Figure 2, representing a re-plot of the raw data from [
42]. The average test-retest correlation across the ten participants was 0.882 and at least above 0.8 for each participant, indicating high rating consistency. The top panel shows results for speech and the bottom panel for the pulse. The color of the box plots indicates the Rendering condition. Here and in the following figures, green shows results for Measured, red shows RAZR, and blue shows ISM (from left to right). The circles represent individual results and the box plots show the median, the 25% and 75% percentiles (interquartile range), and the overall range. Asterisks indicate significant pairwise differences within each Environment and Stimulus, obtained from post-hoc comparisons conducted following a three-way rm-ANOVA [Stimulus (Speech, Pulse) × Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM)]. Across environments, the highest ALOD condition (RAZR) generally yields plausibility ratings comparable to the measured reference, indicating that listeners largely accepted these simulations as plausible representations of the real rooms.
The overall highest and consistent plausibility rating is found for Measured (green) and RAZR (red) in the Pub (middle column of
Figure 2). Here, pairwise comparisons of marginal means averaged across headphone and loudspeaker reproduction revealed that in the Pub, ISM (blue) was significantly less often perceived as plausible than Measured [mean difference (md) = 48.31%;
p = 0.004] and RAZR (md = 50.13%;
p < 0.001). This pattern with a visually clearly lower plausibility for ISM was observed for both stimuli, Speech and Pulse. No significant differences between ISM and Measured and between ISM and RAZR were found when the speech material was used in the Living Room (top row, first column in
Figure 2) and in the Underground Station (top row, third column in
Figure 2). Although visually less apparent, Measured and RAZR received significantly higher plausibility ratings also in the Living Room and Underground for the Pulse stimulus (bottom row of
Figure 2).
Overall, the ANOVA revealed a significant main effect of Rendering [F(1.034, 7.235) = 34.08, p < 0.001, η2 = 0.830] but no main effect of Stimulus or Environment. A significant interaction between Stimulus and Rendering was found. Regarding the source of this interaction, post-hoc tests showed that a significant difference between speech and pulse only occurred for ISM, as observed by comparing the blue box plots of the top row with the blue box plots of the bottom row. Despite this, no significant differences on plausibility were found for Stimulus or Environment.
3.2. Speech Intelligibility
Figure 3 shows the mean SRT for the three environments obtained with headphone presentation. The color coding follows the convention used in
Figure 2. In addition, the orange plots represent the results for Anechoic. The circles represent individual results and the box plots show the median, interquartile, and the overall ranges. Asterisks indicate significant pairwise differences between Renderings within each Environment, obtained from post-hoc comparisons following a two-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, Anechoic)]. Across all environments, Measured (green), RAZR (red), and ISM (blue) yield comparable SRT distributions, whereas the Anechoic condition (orange) consistently shows markedly lower SRTs, visible as a clear downward shift of the orange boxplots. In the Living Room, SRTs for Measured were significantly higher than for RAZR, while RAZR and ISM did not differ. In the Underground Station, Measured differed significantly from ISM, while RAZR and ISM did not differ.
Overall, the rm-ANOVA revealed a significant main effect of Rendering on SI [F(3, 21) = 146.48;
p < 0.001; η
2 = 0.954], with post-hoc pairwise comparisons yielding the same significant differences as indicated for the Pub in
Figure 3: RAZR (red) and ISM (blue) did not significantly differ from Measured (green), indicating that both simulated renderings reproduce SI close to the real-room reference. As also visually apparent in
Figure 3, anechoic (orange), by contrast, yielded significantly lower SRTs than all other renderings (md = 5.31 dB vs. Measured,
p < 0.001; md = 4.55 dB vs. RAZR,
p < 0.001; md = 5.72 dB vs. ISM,
p < 0.001). Additionally, a significant difference between RAZR and ISM was found (md = 1.17 dB,
p = 0.005) with RAZR providing the lower SRTs.
A main effect of Environment was also found [F(1.206, 8.442) = 54.08;
p < 0.001; η
2 = 0.885], with generally lower SRTs in the Pub, followed by the Living Room and the Underground Station as visible in
Figure 3. The ANOVA also revealed a significant Rendering × Environment interaction [F(2.36, 16.54) = 12.71;
p < 0.001; η
2 = 0.645], indicating that the Rendering differences varied somewhat across Environments as shown by the asterisks. However, this interaction does not reflect an effect of Environment, but rather differences in how the renderings performed within each Environment.
3.3. Overall Difference
Figure 4 shows the mean overall difference ratings for each Rendering condition, separated by Presentation Mode (upper three rows for Headphones, lower three for Loudspeakers) and by Stimulus (Speech, Pulse, Bass). The results for Speech and Pulse are a re-plot of the raw data from [
42]. The colors of the box plots follow the color coding of the previous figures with two new Rendering conditions added here (from left to right): green, Measured (headphone presentation only); red, RAZR (reference); blue, ISM; black, RAZR-Simple; teal, RAZR-1st-Order. A value of 0 on the
y-axis represents “no perceptible difference from the reference signal” and a value of 100 represents the rating “very large difference from the reference”. The circles represent individual results, and the box plots show the median, interquartile range, and overall range. The average test-retest correlation was 0.92, and results were pooled across repetitions. Black asterisks indicate significant pairwise differences from the reference Rendering (RAZR, red) within each Environment, Stimulus, and Presentation Mode, obtained from post-hoc comparisons following a four-way rm-ANOVA [Stimulus (Speech, Pulse, Bass) × Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)]. Red asterisks additionally indicate significant differences between Measured and RAZR, obtained from post-hoc comparisons following a three-way rm-ANOVA [Stimulus (Speech, Pulse, Bass) × Environment (Living Room, Pub, Underground Station) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)], covering the full factorial design for headphone presentation.
For the factor Rendering, post-hoc comparisons were conducted only relative to the reference condition (RAZR), as this rendering was available for both presentation modes and served as the baseline for all perceptual evaluations. In general, the hidden reference (RAZR) was consistently detected and rated with no difference, indicating reliability of the results. This is visible in
Figure 4 as near-zero median difference ratings for RAZR across all environments, stimuli, and presentation modes. RAZR and RAZR-1st-Order (teal) were rated similarly across all conditions, while ISM (blue) and Measured (green) produced the largest perceptual differences from the reference. This ordering is consistently reflected across rows and columns of
Figure 4, with ISM showing the highest median difference ratings in most conditions. In the Underground Station, RAZR-Simple (black) without the dual slope decay was also mostly rated with 0 difference to the reference. Confirmed by significant differences marked as black asterisks, ISM produced significantly higher difference ratings than RAZR in the Living Room and Underground Station, whereas Measured differed significantly from RAZR in all environments as indicated by the red asterisks.
The four-way rm-ANOVA revealed significant main effects of Stimulus, Environment, and Rendering, but no main effect of Presentation Mode. Specifically, Stimulus had a significant effect on the perceived overall difference [F(2, 14) = 60.63,
p < 0.001, η
2 = 0.896], indicating that the rated differences depended on the type of target signal. Differences were generally rated higher for the Pulse than for Speech (md = 13.06;
p < 0.001) or Bass (md = 12.49;
p < 0.001) stimuli. This stimulus dependence is visible in
Figure 4 as consistently higher distributions for Pulse compared to Speech and Bass across environments and renderings. A significant main effect of Environment was also found [F(2, 14) = 237.90,
p < 0.001, η
2 = 0.971]. Post-hoc tests for Environment (not shown) revealed that larger differences occurred in the Living Room than in the Pub and Underground Station. Accordingly,
Figure 4 shows generally higher difference ratings in the left column compared to the center and right columns.
The main effect of Rendering was highly significant [F(1.605, 11.235) = 470.88,
p < 0.001, η
2 = 0.985] and post-hoc pairwise comparisons revealed that differences were significantly larger for ISM (md = 58.04,
p < 0.001) and RAZR-Simple (md = 19.55,
p < 0.001) than for the reference RAZR. In contrast, RAZR and RAZR-1st-Order differed only slightly (md = 0.85,
p = 0.039). This pattern is clearly visible in
Figure 4, where both Renderings show near-zero median differences across all stimuli. Across environments and stimuli, RAZR and RAZR-1st-Order were rated most similar, while ISM and RAZR-Simple produced substantially higher difference ratings.
Results from the additional three-way rm-ANOVA (headphone data only) showed a significant main effect of rendering and the related post-hoc tests confirmed that the Measured condition was consistently rated as significantly different from the RAZR reference across all environments as indicated by the red asterisks in
Figure 4. No significant main effect of Presentation Mode was found [F(1, 7) = 2.32,
p = 0.172, η
2 = 0.249], indicating that overall difference ratings were similar between headphone and loudspeaker presentation. This similarity is reflected in
Figure 4 by comparable distributions across the upper (headphone) and lower (loudspeaker) panels. A Stimulus × Environment interaction was found [F(4, 28) = 9.99,
p < 0.001, η
2 = 0.588]. Post-hoc tests showed that this interaction was driven by the Bass stimulus: ratings in the Living Room differed significantly from those in the Pub and the Underground Station. For Speech and Pulse, no such differences between environments occurred.
3.4. Spatial Audio Quality Items
Figure 5 shows the results for headphone (
Figure 5 left) and for loudspeaker (
Figure 5 right) presentation. Each row in both main panels represents one spatial audio quality item and each column an environment. The horizontal line in each plot marks a score of 50, corresponding to “no perceptual difference between the test signal and the reference.” Circles represent individual results and the box plots show the median, interquartile, and the overall ranges. The color coding follows the convention of
Figure 4: green = Measured (headphones only), red = RAZR (reference), blue = ISM, black = RAZR-Simple, teal = RAZR-1st-Order. Asterisks indicate significant pairwise differences from the reference (RAZR) within each environment, obtained from post-hoc comparisons following a three-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, RAZR-Simple, RAZR-1st-Order)] performed for each spatial audio quality attribute. For the factor Rendering, post-hoc comparisons were performed only relative to RAZR, as it served as the common reference across presentation modes. Additional two-way rm-ANOVAs [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, RAZR-Simple, RAZR-1st-Order)] for the headphone data (
Figure 5 left) revealed no significant differences between the Measured condition and the RAZR reference for any of the sound-quality items.
A first inspection of
Figure 5 shows that large perceptual differences between renderings occurred only occasionally. Specifically,
Figure 5 shows that ratings for RAZR and RAZR-1st-Order remain tightly clustered around the “no difference” line at 50 across most items and environments, whereas larger deviations are primarily associated with ISM (blue) and, to a lesser extent, RAZR-Simple (black). When present, these differences were most pronounced in the Living Room. This pattern is visible across multiple rows in the left column of
Figure 5, where ISM consistently shows larger deviations from the reference than in the Pub or Underground Station. The data also show considerable inter-individual variability, resulting in relatively few statistically significant differences overall, as reflected by the sparse asterisks in each panel. No major differences could be found between the two Presentation Modes, Headphones and Loudspeaker. This similarity is reflected in
Figure 5 by comparable distributions and medians between panels A (headphones) and B (loudspeakers) for most items and renderings. For headphones (
Figure 5 left), Measured (green) showed small deviations from the RAZR reference (red); however, these differences often occurred in both positive and negative directions across participants and environments, resulting in mean ratings that remained centered around the reference, between 40 and 60. Accordingly,
Figure 5 left shows Measured ratings scattered symmetrically around the reference line, rather than a consistent shift in a single perceptual direction. Consequently, no significant differences between Measured and RAZR were observed, except for Source Width in the Underground Station. For tone color, a small but consistent (yet non-significant) difference is visible between Measured and RAZR for all environments. This can be observed in
Figure 5 left as a slight upward shift of the green distributions relative to the red reference across environments, without exceeding the variability observed across listeners.
The following description summarizes the significant main effects, 1st-order interactions, and post-hoc results obtained from the three-way rm-ANOVAs:
Effect of Environment. The three-way rm-ANOVAs revealed significant main effects of Environment for the items Distance (η2 = 0.69), Reverberation (η2 = 0.81), Source Width (η2 = 0.67), Tone Color (η2 = 0.40), and Metallic Tone Color (η2 = 0.40). In general, perceived differences from the reference in Distance and Reverberation were rated higher in the Living Room than in the Pub and Underground Station.
For Reverberation, significant interactions were observed with both Presentation Mode and Rendering. A significant Environment × Presentation Mode interaction showed that, for headphone presentation, perceived differences from the reference were significantly larger in the Living Room than in the Underground Station, whereas for loudspeaker presentation, the Living Room differed from both the Pub and the Underground Station. This indicates that environment-dependent deviations in perceived reverberation from the reference were more pronounced under loudspeaker playback.
Furthermore, a significant Environment × Rendering interaction indicated that these Environment-dependent variations occurred primarily for the ISM rendering, whereas the other renderings showed more consistent ratings across environments. This pattern suggests that the reverberant character reproduced by ISM was particularly sensitive to the acoustic context, resulting in stronger perceived deviations from the reference in the Living Room.
Visually, this is reflected in
Figure 5 by larger ISM deviations in the Living Room than in the other environments, especially for reverberation-related items. Several other higher-order interactions between Environment and other factors (e.g., Rendering) reached statistical significance; however, post-hoc comparisons revealed no systematic or interpretable trends beyond the main effects. Occasional differences appeared for individual renderings within single environments (e.g., ISM or RAZR-Simple in the Living Room), but these effects were inconsistent across items and did not alter the overall pattern of results. Therefore, only the main effects and interactions that revealed coherent, perceptually meaningful differences are reported in detail.
Effect of Rendering. The three-way rm-ANOVAs revealed significant main effects of Rendering for the items Distance (η2 = 0.55), Metallic Tone Color (η2 = 0.43), and Reverberation (η2 = 0.43), indicating that the perceived deviation from the reference depended on the rendering method. In general, ISM and RAZR-Simple led to the largest perceptual deviations from the reference (i.e., ratings that were shifted furthest away from the “no difference” point at 50), whereas RAZR and RAZR-1st-Order produced ratings closest to 50 and were therefore most similar to the reference.
For Distance, post-hoc comparisons showed a significant difference only between RAZR-Simple and RAZR-1st-Order. No other pairwise comparisons between renderings reached significance. As seen in the first column, first row of
Figure 5 left, this effect was mainly driven by the Living Room condition, where RAZR-Simple produced larger deviations from the reference than RAZR-1st-Order. In the other environments, the two renderings were more similar. No consistent differences were observed between the remaining renderings.
For Metallic Tone Color, post-hoc tests revealed significant differences between RAZR and RAZR-Simple, and between RAZR-Simple and RAZR-1st-Order. No other rendering pairs differed significantly. In the Living Room (
Figure 5 left, fourth row, first column), RAZR-Simple produced a perceptual shift in metallic coloration relative to both the full RAZR and the RAZR-1st-Order conditions, suggesting that this attribute was particularly sensitive to the simplified coupled-room configuration used in RAZR-Simple.
For Reverberation, although the main effect of Rendering was statistically significant, none of the pairwise post-hoc comparisons between renderings reached significance. Thus, the overall effect of Rendering on perceived reverberation differences from the reference was small and not attributable to a specific pair of renderings. Overall,
Figure 5 reveals a clear hierarchy of rendering performance: RAZR and RAZR-1st-Order remain closest to the reference across most items, RAZR-Simple shows moderate deviations depending on environment, and ISM produces the largest and most systematic perceptual differences. Several significant interaction effects were observed between the experimental factors, most prominently involving Rendering and Environment.
For the items Distance, Metallic Tone Color, and Reverberation, the Rendering × Environment interactions followed a consistent pattern. Significant rendering differences occurred mainly in the Living Room and were driven by the ISM and RAZR-Simple renderings, both of which were rated as perceptually more different from the reference than the other renderings. The remaining renderings showed more stable ratings across environments and remained perceptually closer to the reference.
Two notable exceptions were observed. For Source Width, significant rendering differences appeared only in the Underground Station, where ISM again differed from all other renderings. For Envelopment by Reverberation, a Rendering × Presentation Mode interaction indicated that significant differences between RAZR and ISM occurred only under Loudspeaker presentation, but not with Headphones. Overall, these findings indicate that rendering-related perceptual deviations were most pronounced for ISM and, to a lesser extent, RAZR-Simple, particularly in the Living Room and Underground Station.
Effect of presentation mode. A significant main effect of Presentation Mode was found for Envelopment by Reverberation (η2 = 0.46). Ratings were, on average, about seven points lower under Loudspeaker presentation than under Headphone presentation, indicating a reduced sense of envelopment when reproduced over loudspeakers.
A significant Rendering × Presentation Mode interaction was also observed for Envelopment by Reverberation. Post-hoc comparisons showed that this difference between presentation modes occurred only for the ISM rendering, whereas all other renderings yielded comparable ratings for headphones and loudspeakers. This interaction is visible in
Figure 5 as a downward shift of ISM ratings for Envelopment under loudspeaker presentation compared to headphones. In addition, a significant Environment × Presentation Mode interaction was found for the item Reverberation. Here, the Pub environment showed a presentation-dependent difference: under Headphone presentation, reverberation was rated slightly lower (−4.2 points relative to the reference), whereas under Loudspeaker presentation it was rated higher (+6.7 points). No such differences between presentation modes were observed in the other environments.
Overall, these findings suggest that the influence of presentation mode on the spatial audio quality ratings was limited and primarily affected perceived Envelopment, particularly for the ISM rendering and in the Pub environment.
3.5. Externalization
Figure 6 shows the mean externalization ratings (representing a re-plot of the raw data from [
42]) for each Rendering condition, separated by Presentation Mode (upper two rows, Headphones; lower two rows, Loudspeakers) and by Stimulus (Speech, Pulse). The color coding follows the convention of the previous figures: green = Measured (headphones only), red = RAZR, blue = ISM, and yellow = Diotic. The circles represent individual results, and the box plots show the median, interquartile, and the overall ranges. The average test–retest correlation was 0.84, and results were pooled across repetitions. Asterisks indicate significant pairwise differences between renderings within each Environment, Stimulus, and Presentation Mode, obtained from post-hoc comparisons following a four-way rm-ANOVA [Stimulus (Speech, Pulse) × Environment (Living Room, Pub, Underground Station) × Presentation Mode (Headphones, Loudspeaker) × Rendering (RAZR, ISM, Diotic)]. An additional two-way rm-ANOVA [Environment (Living Room, Pub, Underground Station) × Rendering (Measured, RAZR, ISM, Diotic)] was conducted in order to analyze differences between Measured and the simulation for the headphone data.
Across both Stimuli, ratings for RAZR and Measured produced the highest externalization ratings, followed by ISM. This pattern is consistently visible in
Figure 6 as higher median ratings for RAZR and Measured across all environments. For headphone measurements, Diotic was the least externalized simulation, while it was among the most externalized simulations for the Loudspeaker presentations. This pronounced reversal between presentation modes is clearly visible in
Figure 6, particularly for both stimuli across all environments. For Speech, externalization was generally weak under headphone presentation across all renderings, suggesting that listeners found speech particularly prone to in-head localization in this playback condition, whereas the Diotic rendering yielded the poorest externalization overall, as was expected.
In the following, only main results and first-order interactions from the four-way ANOVA are reported. The three-way ANOVA did not reveal any additional relevant findings. The four-way rm-ANOVA revealed significant main effects of Presentation Mode [F(1, 7) = 9.39; p = 0.018; η2 = 0.57] and Rendering [F(2, 14) = 8.04; p = 0.005; η2 = 0.54]. Overall, sounds were rated as more externalized under loudspeaker presentation than under headphone presentation, indicating that externalization benefited from the additional spatial cues available during loudspeaker playback.
Post-hoc comparisons confirmed that RAZR was rated significantly more externalized than ISM (
p < 0.05), whereas the difference between RAZR and Diotic did not reach significance. This result is reflected in
Figure 6 by the consistently lower median externalization ratings for ISM compared to RAZR across environments.
A significant Environment × Rendering interaction in the four-way rm-ANOVA was observed. In the Living Room, ISM was rated significantly less externalized than RAZR, while in the Pub and Underground Station, no significant differences between renderings were found. Visually, this interaction manifests as larger separation between ISM and RAZR in the Living Room panels compared to the other environments. This pattern suggests that the rendering-dependent differences in perceived externalization were most pronounced in less reverberant acoustic contexts. A significant Presentation Mode × Rendering interaction was also found in the four-way rm-ANOVA. Post-hoc tests revealed that this interaction was driven solely by the Diotic condition: for headphone presentation, Diotic signals were rated strongly internalized (mean = 18.8), whereas for loudspeaker presentation, the same condition yielded substantially higher externalization ratings (mean = 63.1; md = 44.4,
p < 0.001). This strong interaction is directly observable in
Figure 6 as a marked upward shift of Diotic ratings from the headphone to the loudspeaker panels. For RAZR and ISM, no significant differences between headphone and loudspeaker presentation were observed.
Taken together, these results show that perceived externalization was strongly influenced by both presentation mode and rendering method. Externalization was consistently highest for RAZR, lowest for Diotic, and moderately reduced for ISM, with the most pronounced rendering effects occurring under headphone presentation and in the Living Room environment.
4. Discussion
4.1. Role of ALOD Across Different Perceptual Measures and Stimuli
The perceptual influence of the ALOD varied considerably across perceptual measures, Environments, Stimulus types, and Rendering conditions. For the speech stimulus, plausibility judgments were relatively robust to reductions in ALOD as long as essential acoustic cues such as the overall reverberation time and the basic pattern of early reflections were maintained. In both the Living Room and Underground Station, even the simplest ISM rendering achieved plausibility ratings close to those obtained with the measured BRIRs. This suggests that for speech, which contains limited spectral content and temporal modulations lacking strong transients, perceptual plausibility does not rely on a highly detailed representation of the reverberant sound field as long as the general acoustic characteristics of the environment are preserved.
SI exhibited a similar pattern of robustness. Across all Environments, SRTs remained stable for most Renderings and showed no significant degradation compared to the measured reference. This aligns with established evidence that the coarse pattern of early-arriving reflections and the direct-to-reverberant energy ratio, both of which are maintained in all renderings, are dominant factors affecting speech perception in rooms [
52]. Only in the Underground Station did ISM lead to significantly higher SRTs than for Measured. This is likely related to an unrealistically sparse pattern of reflections obtained in the shoebox model of the room, in connection with an overly slow increase in echo density. This property of the ISM has been shown for a large room volume in Ewert et al. [
18] using the normalized echo density (NED; [
74]), demonstrating that a slow buildup of echo density reduces the temporal and spatial continuity of the reverberant field, and likely impairing SI in this acoustically complex space.
The sensitivity to ALOD increased markedly for the Pulse and Bass stimuli, where transient onsets and broadband spectral content accentuated differences between Renderings. Plausibility ratings dropped sharply for ISM in all Environments, and Overall Difference ratings showed the largest deviations from the reference for ISM and RAZR-Simple. These findings demonstrate that renderings with reduced ALOD, such as fewer early reflections, missing scattering effects, or overly regular reflection patterns, fail to reproduce the fine temporal and spectral cues that are perceptually relevant for transient or broadband stimuli.
For the spatial audio quality items, which were evaluated using only the “most critical” Pulse stimulus, the effects of reduced ALOD were particularly evident in the Living Room: Both ISM and RAZR-Simple were perceived as significantly different from the reference, particularly for Distance, Reverberation, and Metallic Tone Color. These deviations point to inaccuracies in the representation of coupled-room effects and the spatial distribution of early reflections. In RAZR-Simple, the simplified coupled-room configuration did not yield a perceptually equivalent reproduction of the reference, underlining that modeling simplifications in geometrically complex Environments directly impair perceived spatial audio quality. Although the Overall Difference results in
Figure 4 revealed significant perceptual deviations between the Measured and RAZR, these differences were not mirrored in the ratings of the individual spatial audio quality attributes. This suggests that overall difference may reflect a composite percept that integrates several of the tested spatial audio quality attributes even when their individual contribution did not reach significance, such as spectral coloration, or was differently but inconsistently rated. Alternatively, the underlying spatial audio quality differences may not have been fully captured by the here selected seven attributes. Future studies could address this by expanding the attribute set or by applying data-driven analyses, such as multidimensional scaling, to identify the dominant perceptual dimensions contributing to global judgments of simulation fidelity.
The influence of ALOD reductions in RAZR-Simple was particularly evident in the Pub, where Overall difference ratings showed larger deviations from the reference. These deviations can be attributed to the omission of two nearby reflecting surfaces, a chalkboard and a table, resulting in a perceptible change in spatial impression. Listeners reported a missing lateral reflection from approximately 30–40° to the left of the source direction, where the chalkboard is positioned. This omission was particularly salient under loudspeaker presentation, where natural binaural cues and head movements enhanced spatial resolution. The higher overall difference ratings for RAZR-Simple under loudspeaker playback thus likely reflects the absence of these distinct early reflections.
In contrast, for the Underground Station, RAZR-Simple differed from RAZR through the omission of the dual-slope decay in the late reverberation tail. Despite this physical simplification, no perceptual differences were found between the two renderings. Analysis of the normalized energy decay curves shows that the secondary slope began only at a very low energy level (below roughly −40 dB), where its contribution to perceived reverberant character is minimal. Consequently, both versions produced nearly identical decay behavior within the time-intensity region most relevant to perception. This explains why the absence of the dual-slope tail did not lead to perceptual degradation, even though the underlying acoustic model was simplified.
Externalization results provided additional evidence that variations in the ALOD influence spatial aspects of auditory perception. RAZR produced the highest externalization ratings, followed by ISM, while Diotic signals were perceived as strongly internalized. The reduced externalization for ISM suggests that missing spatial detail and inaccurate room-related cues limit the perception of external auditory space. Similar tendencies have been reported in studies linking spatial externalization to the precision of binaural and room-related cues [
49]. Interestingly, across all three environments, the Diotic condition was in most cases rated as more externalized than the simulated renderings, with the exception of the Pub for the Pulse stimulus. This finding may be explained by the loudspeaker presentation setup. In the Diotic condition, a single loudspeaker positioned 2.4 m in front of the listener was used. For some participants, this clear frontal localization may have been perceived as more external than the more spatially diffuse sound fields generated by the multichannel renderings, which distribute acoustic energy across multiple loudspeakers from different directions. Such increased spatial diffuseness can reduce externalization by weakening the consistency between auditory and spatial cues [
57,
75].
One important difference between the current ALOD variations and earlier studies such as Martin et al. [
43] and Abd Jalil et al. [
29], is that the current acoustic model was always based on a “proxy” shoebox simplification of each room, as outlined above. Thus, even with the highest ALOD, the pattern of early reflections was always approximate rather than geometrically exact. Despite this simplification, the RAZR rendering produced plausibility, SI, spatial audio quality, and externalization ratings that closely matched those obtained with the measured reference, confirming that perceptual agreement can be achieved even when using simplified geometries, provided that key temporal and spectral room characteristics are accurately represented (see also result for RAZR in [
21]; simulation B). Nevertheless, the shoebox approximation becomes more critical for lower ALOD conditions, particularly for ISM, where the lack of geometric detail led to sparse and overly regular reflection patterns. Such patterns reduce echo density and temporal diffusion, impairing the continuity of the reverberant field, as discussed above and shown for large-volume spaces in [
18]. This highlights that while high-ALOD shoebox simulations can reproduce perceptual realism when properly parameterized, further simplifications risk perceptual degradation due to insufficient spatial and temporal reflection density.
Taken together, these findings indicate that the perceptual importance of ALOD depends strongly on the specific perceptual measure as well as on the stimulus type. Measures associated with speech as a stimulus and speech communication in general, such as plausibility and intelligibility, are comparatively tolerant to rendering simplifications as long as the overall reverberation time, the balance between direct and reverberant energy, and the general distribution of early reflections are preserved. In contrast, broadband and transient stimuli such as the Pulse emphasized differences in reflection timing, coloration, and spatial coherence, leading to stronger perceptual deviations when ALOD was reduced, as also found in [
67]. For the Bass stimulus, overall difference ratings were lower across environments and renderings. Across all stimuli, higher ALOD was particularly critical for perceptual dimensions relying on fine spatial or spectral cues, such as spatial audio quality attributes and externalization. The results for RAZR-Simple further demonstrate that reducing geometric detail, particularly in coupled spaces, can lead to noticeable perceptual deviations.
4.2. Role of Environment
The acoustic Environment had a pronounced influence on the sensitivity to changes in ALOD across most perceptual measures. Across perceptual measures, the Living Room consistently produced the largest deviations from the reference, while the Underground Station and the Pub generally yielded smaller or more stable differences. The Living Room was particularly challenging to simulate due to its coupled-room configuration, which introduces complex reflection patterns and modal interactions that are difficult to reproduce accurately with simplified geometric models.
For plausibility and spatial audio quality, both ISM and RAZR-Simple produced perceptually larger deviations from the reference in the Living Room than in the other Environments. The results suggest that the critical coupled-volume geometry amplified perceptual differences between Renderings, particularly for attributes related to reverberation and tonal coloration. This is consistent with earlier studies demonstrating that the perception of spatial fidelity in coupled or irregularly shaped rooms depends strongly on the correct reproduction of reflection density and decay characteristics [
23,
49].
In contrast, the Pub and Underground Station exhibited greater perceptual tolerance to simplified modeling for plausibility, spatial audio quality, and externalization. In these Environments, the long or diffuse reverberation provided a masking effect that reduced sensitivity to errors in spatial detail. In the Underground Station, interestingly plausibility ratings for ISM remained close to the reference; however, SI was reduced due to the unrealistically sparse reflection pattern and slow buildup of echo density characteristic of the ISM, as discussed above. These deficiencies likely disrupted the temporal continuity of the reverberation and thereby impaired speech cue transmission. Similar masking effects of diffuse late energy on spatial perception have been observed in previous studies [
54,
76]. Our current SI results for the Underground Station align well with those reported by Hladek and Seeber [
64], who measured SRTs at the same position (denoted as R1, pos. 16 in their study). Likewise, Schütze et al. [
4] evaluated SI at Position S7 in the real Living Room with the same location for receiver and target source that was used in our study, and reported SRTs comparable to our measured reference. These findings support the validity of our measurements.
The Environment also influenced the results for individual spatial audio quality attributes. Perceived Distance and Reverberation were rated most different from the reference in the Living Room, whereas Source Width and Metallic Tone Color showed smaller variations across Environments. These findings correspond to the significant Environment × Rendering interactions, which were mainly driven by the ISM rendering. The ISM produced particularly large deviations in the Living Room, indicating that its reverberation was more affected by the acoustic complexity of the environment. In contrast, the other renderings yielded more consistent ratings across Environments, suggesting that the perceptual impact of ALOD reductions depends on both the simulated environment and the rendering method.
For externalization, the influence of the Environment was limited overall but became apparent in the Living Room, where ISM was perceived as significantly less externalized than RAZR, as the Environment × Rendering interaction showed. This indicates that in acoustically complex but less diffuse spaces such as the Living Room, listeners were more sensitive to rendering inaccuracies that disrupt binaural or room-related cues, which are known to support externalization [
56,
75]. In contrast, in the more reverberant Pub and Underground Station, late reflections masked these differences, leading to comparable externalization ratings across renderings.
A closer examination of the Underground Station data revealed that externalization ratings for speech were generally low across all renderings under headphone presentation. This finding can be explained by the room’s specific acoustic characteristics: although it exhibits a long reverberation time, the overall reverberation level is comparatively low. As a result, much of the reverberant energy remains masked by the speech and, therefore, contributed little to externalization [
28,
55]. In contrast, for the Pulse stimulus, the reverberant decay was always audible, allowing late reflections to enhance spatial impression and perceived externalization. In this condition, the Measured rendering produced slightly higher externalization ratings than RAZR, likely reflecting the additional fine spatial and spectral details contained in the measured room responses.
Taken together, these findings demonstrate that the acoustic properties of the Environment shape how listeners evaluate rendering fidelity. In acoustically simple or diffuse spaces, errors in simulation detail are perceptually less relevant due to reverberant masking. However, in Environments with coupled volumes, strong directional reflections, or complex geometries, perceptual sensitivity to rendering accuracy increases considerably. Consequently, the required level of ALOD depends not only on the perceptual measure but also on the acoustic structure of the Environment being reproduced.
4.3. Role of Presentation Mode
Presentation mode affected spatial perception differently across the evaluated measures. For Overall Difference, headphone and loudspeaker presentations yielded statistically similar ratings, indicating that the perceived global similarity between renderings was largely independent of presentation mode. In contrast, for measures directly related to spatial hearing, such as externalization, presentation mode had a pronounced effect, with loudspeaker playback generally enhancing spatial impression. This suggests that participants were able to form consistent judgments about rendering fidelity even when the spatial information was conveyed through individualized binaural playback rather than through free-field loudspeaker reproduction.
In contrast, several spatial audio quality attributes were modestly affected by presentation mode. One notable difference appeared for Envelopment by Reverberation, where ratings for ISM were, on average, lower than the other Renderings under loudspeaker presentation, which was not observed for headphones. This reduction likely reflects the additional spatial cues available in the other Renderings within the physical playback environment, which may have altered listeners’ perception of the simulated reverberant field [
28,
55]. For the ISM rendering, this effect was particularly pronounced, suggesting that limited spatial coherence in the simulation was perceptually masked under headphones but revealed more clearly when compared to the real acoustic context of loudspeaker playback. A smaller, environment-specific presentation mode effect also occurred in the Pub, where reverberation was rated slightly weaker under headphones and stronger under loudspeakers, possibly due to differences in the integration of simulated and real spatial cues.
The influence of presentation mode was most apparent for Externalization, which showed a strong and systematic dependence on playback condition. Externalization ratings were generally higher under loudspeaker presentation, reflecting the contribution of naturally occurring head movements and dynamic spatial cues that reinforce external perception [
56,
77]. Under headphone playback, where such cues were absent, listeners were more susceptible to in-head localization, especially for the Diotic rendering. This rendering produced strongly internalized images during headphone presentation but was perceived as substantially more externalized when played over loudspeakers, confirming that the availability of congruent spatial cues from the playback setup can compensate for missing binaural information. In contrast, RAZR and ISM maintained stable externalization ratings across presentation modes, indicating that their spatial renderings provided sufficient directional consistency to support externalization even under headphone reproduction. Beyond head-movement-related dynamic cues, externalization under headphone reproduction is influenced by several additional factors. While a spectrally smoothed headphone equalization filter was applied in the present study to reduce spectral coloration differences between the renderings, externalization remains sensitive to mismatches between a listener’s individual HRTFs and the non-individualized HRTFs used for binaural rendering. Such mismatches can lead to spectral inconsistencies that weaken distance and externalization cues, particularly under static listening conditions. Previous work has shown that even when headphone transfer functions are equalized, deviations from individual binaural cue patterns can substantially reduce perceived externalization [
56,
77]. In contrast, loudspeaker playback naturally provides individualized spectral cues and dynamic head-movement information, which likely contributed to the higher externalization ratings observed in this condition.
Taken together, these findings demonstrate that presentation mode primarily affects perceptual dimensions that depend on dynamic or spatially extended cues, such as externalization and envelopment. When accurate room simulation is available, headphone reproduction can yield perceptually equivalent results for most spatial audio quality attributes, but for the perception of spatial extent and realism, loudspeaker playback continues to offer a distinct advantage.
4.4. Comparison to Geometry Decimation and Outlook
Starting from the current highest ALOD, which already includes substantial geometric simplifications by using a proxy rectangular room, the results showed that particularly for speech, further reductions of ALOD were perceptually tolerable—for instance, reducing the number of early reflections or omitting scattering effects, as in the ISM. It should be noted that this lowest tested ALOD (ISM) would actually be more computationally demanding than the highest ALOD (RAZR), again underlining that the present definition of ALOD is not based on geometric or computational complexity but rather on perceptual properties of the room simulation. The inclusion of nearby reflecting surfaces in the Pub environment specifically added geometric detail to the room acoustics simulation, improving over the basic shoebox simplification for early reflections. In the current results, this addition proved perceptually important, as the omission of these reflectors in RAZR-Simple led to clearly detectable spatial differences under loudspeaker presentation.
Bridging the gap to studies that have investigated geometry decimation, e.g., [
29,
43], future research should extend the current approach by systematically increasing geometric detail stepwise from the shoebox approximation to assess potential benefits in conditions where the real environment cannot be meaningfully modeled by connected rectangular rooms, such as connected corridors, obstructing structures, or columns. The role of edge diffraction has already been examined in recent studies, e.g., [
78], and our current acoustic modeling framework has been extended to cover arbitrary geometry and edge diffraction using computationally efficient filter-based approximations [
47,
48,
79,
80].
4.5. Methodological Limitations and Generalizability
A potential methodological limitation of the present study is that all participants had prior experience with psychoacoustic listening tests. While this may restrict direct generalizability to the broader population, listener expertise was considered advantageous for the attribute-based perceptual tasks employed here. The evaluation of sound quality attributes such as spectral coloration, envelopment, or source width requires a consistent interpretation and use of perceptual rating scales, which has been shown to benefit from prior training and listening-test experience [
27,
28].
Previous research has demonstrated that naïve listeners are often able to detect perceptual differences, but may exhibit increased variability and reduced consistency when assigning ratings to specific auditory attributes or when interpreting scale directions, particularly for abstract spatial or timbral descriptors [
81]. Standardized listening-test methodologies therefore commonly recommend the use of trained listeners for analytical quality assessments to improve reliability and reduce response variance [
73].
The limited number of statistically significant effects observed for sound quality attributes in the present study despite clearly detectable overall differences, however, may cast some doubt on the level of experience of the listeners. Alternatively, it suggests that these attributes are only weakly affected by the tested reductions in ALOD. This interpretation is further supported by a sensitivity power analysis of the repeated-measures design, which indicated that the available sample size (N = 8) was sufficient to detect medium to large within-subject effects, but not small effect sizes. Accordingly, subtle perceptual differences and higher-order interactions may remain undetected, particularly for attribute-based sound quality ratings that are known to exhibit substantial inter-individual variability. Nevertheless, future work could complement the present findings by including naïve or more heterogeneous listener groups, particularly for preference-based judgments or evaluations of ecological validity in everyday listening contexts.
Participants were informed during the familiarization phase whether example stimuli were based on measured or simulated BRIRs. Importantly, familiarization stimuli were generated from different source and receiver positions within the same environments than those used in the actual measurements, such that the familiarization stimuli were acoustically distinct from the test stimuli. While this procedure may have introduced expectations during familiarization, the use of different spatial configurations limits the possibility that participants learned fixed perceptual cues associated with the measurement condition.
A further limitation of the present study is that all simulations were generated using a single room-acoustic simulation framework, namely RAZR. While this allowed for a controlled and systematic manipulation of ALOD within a consistent modeling environment, the extent to which the present findings generalize to other room-acoustic simulation approaches remains an open question. While different algorithms, such as ray-tracing methods, or hybrid and wave-based models, may differ in how specific modeling simplifications affect perceptual outcomes [
29,
36,
46], the present results may be considered representative for image source methods. The comparison to the “straightforward” higher-order ISM and the systematic reduction of image source order in RAZR underlines the importance of correctly accounting for the late reverberant tail, an observation which should generalize beyond the specific implementation used here.
In addition, it should be noted that even the highest ALOD condition investigated in the present study was based on a simplified proxy shoebox geometry. This approximation has been shown to provide perceptually plausible results for a range of indoor environments: The underlying RAZR framework has previously been evaluated in a broader context, including geometrically complex real-room scenarios, for example in the room-acoustical simulation and auralization round robin reported by Brinkmann et al. [
21]. Nevertheless, the applicability of the specific ALOD manipulations investigated here to geometrically more complex spaces, such as environments with irregular layouts, obstructing structures, columns, or strongly non-rectangular geometries, remains an open question. Particularly for curved surfaces and vaulted spaces, e.g., [
82,
83], the underlying proxy shoebox approach is not able to cover sound focusing and spreading. In such cases, additional geometric detail or alternative, e.g., wave-based modeling may be required to preserve spatial cues related to early reflections, diffraction, and directional energy flow. For low-frequencies, it is expected that wave-based approaches show a better agreement with measurements and hybrid wave/ray-based approaches [
82,
84] have shown a good agreement between measurement and simulation for the entire audio frequency range, with modeling detail depending on simulation paradigm and frequency range. Future work should replicate the present experimental paradigm using different simulation engines to assess the robustness of the observed trends across modeling approaches and to identify algorithm-specific sensitivities.
Finally, head tracking was not employed during the headphone-based experiments, and all binaural renderings were presented under static listening conditions. As a result, the present findings primarily apply to scenarios without dynamic listener–environment interaction, such as conventional headphone listening or VR/AR applications with limited or unavailable head tracking. Dynamic head movements are known to provide additional spatial cues that enhance externalization and spatial perception. Consequently, the perceptual effects of ALOD reductions observed in the present study may be altered in systems employing accurate real-time head tracking.