2. Materials and Methods
To obtain subjective evaluations from participants, listening tests were conducted. This involved simulating driving scenes in laboratories using a record-based simulation approach. To prepare stimuli for the listening test, driving-by-scenes were recorded in two simulation environments, utilizing a dummy-head setup (details below). The binaural audio recording allows a realistic playback in later listening experiments. In order to create an experimental setup allowing for the comparison of different laboratories, headphones were used.
2.1. Laboratories
As described in
Section 1.1, playback environments, such as HOA and WFS systems, are able to reproduce a distinct sound spatialization, and also complex sound fields from real environments. Ref. [
4] With less speaker spacing, the aliasing frequency of WFS systems increase, which causes the unbiased reproduction to be located within a higher frequency range. For this reason, the distances between the speakers in the WFS at TUD are set as small as possible. A higher order of a HOA system leads to more spherical harmonic functions. In such a scenario, the reproduced sound field provides a sound field approximation over a broader surface around the listener’s position. With the 15th order system at JGU Mainz, the reproduction occurs within a detailed sound field in the middle of the loudspeaker array.
2.1.1. Wave Field Synthesis (WFS) at TUD Dresden University of Technology
Perceived sound not only activates our sensations; it also acts as a carrier of environmental information. Virtual-reality technology offers real-time interaction with computer-generated environments, facilitating the flexible and cost-effective presentation of complex experimental scenarios. The Multi Modal Measurement Laboratory of TUD Dresden University of Technology [
17] spans 24 square meters, and is characterized by a rectangular shape, with walls inclined by 3° to minimize fluttering echo. Acoustical treatments aligned with standards such as ITU-R BS1116.1 and DIN 15996 [
18,
19] have been attached to the walls and ceiling. Perforated metal sheets, covering 20% of the area, along with strategically placed Helmholtz resonators in the corners, provide an acoustically enhanced treatment over a wide frequency range. For audio reproduction, a WFS system from IOSONO was installed. The setup comprises 464 loudspeakers and 4 subwoofers, with each loudspeaker panel housing 6 tweeters and 2 mid-range speakers. To combat the spatial aliasing effect, the tweeters were placed with distances of 6 cm between them, minimizing the aliasing frequency to approximately 3000 Hz, i.e., in the higher frequency range. Each loudspeaker is driven individually, so in total, 468 separate channels are handled. The control unit oversees signal processing, communication with audio servers, and routing systems handling internal and external connections. Eight rendering PCs, executing the WFS algorithm in real-time, are housed in an acoustically isolated external server room, ensuring optimal performance. Calibration serves as a crucial step which ensures scenes are played back at their original level, reproducing real traffic conditions accurately. To calibrate the system, pink noise from 125 Hz to 8000 Hz with a sound pressure level (SPL) of 80 dB is used. This pink noise is projected from a specific point positioned eight meters in front of the listener and captured at the listener’s position using a free-field microphone (capsule type 4188; preamp type 2671, Brüel & Kjaer, Kopenhagen, Denmark). By comparing the recorded SPL with the calculated reference based on source distance and the principles of sound propagation, the difference in levels is identified. This facilitates adjustments made within the IOSONO control unit to align the overall reproduced volume with the calculated reference. To fine-tune the frequency spectrum of the system, an equalizer application developed by Beyer et al. [
20] is utilized. This application allows for the creation of individual filters to balance sound characteristics that could arise from room acoustics and treatments such as metal sheets. By selecting pink noise as a reference sound and analyzing both the original stimuli and recorded audio files, the filter magnitudes for each third octave band can manually be adjusted by visually comparing Fast Fourier Transform (FFT) representations. These filter coefficients are stored for subsequent use, ensuring the signal is appropriately filtered before playback via the WFS system.
2.1.2. Combination of Higher Order Ambisonics (HOA) and 3D VBAP at Johannes Gutenberg University Mainz
In Mainz, a combination of HOA [
3] and VBAP [
21] was used. The direct sound and the reflections from the ground surface and house fronts were rendered separately. For the direct sound, we used 15th order 2D Ambisonics with
maxRe decoding [
22,
23], played back via 32 Genelec 8020 DPM loudspeakers arranged in a circle with a 4.6 m diameter (spaced 11.25 degrees apart); these were positioned approximately at the participants’ ear height (speaker height 163 cm above the floor). Additionally, a subwoofer was used (Genelec 7360 APM; crossover frequency 62.5 Hz). To simulate the sound reflections from the ground surface and other acoustically reflective surfaces such as house fronts, we used 3D VBAP [
21] with the full loudspeaker array, containing both the aforementioned 32 loudspeakers at ear height and an additional lower ring of 8 Genelec 8020DPM speakers (spaced 45 degrees apart and angled towards the listener’s head; ring diameter 4.6 m, speaker height 87 cm above the floor) and the subwoofer. The array was driven by daisy-chained Ferrofish A32 Pro (24-bit audio resolution,
fs = 44.1 kHz) and Ferrofish Pulse 16 (24-bit audio resolution,
fs = 44.1 kHz) audio converters. The Ferrofish A32 Pro received audio signals via 64-channel MADI from an RME HDSPEe MADI audio interface on a computer running TASCAR on Linux. The 32 loudspeakers of the ear-height ring were driven by the Ferrofish A32 Pro. The 8 loudspeakers of the lower ring and the subwoofer were driven by the Ferrofish Pulse 16. For the present experiment, additional dummy-head recordings were made in a configuration in which both the direct and reflected sound was rendered on a subset of 16 equally spaced loudspeakers in the circular array at ear height. A similar 16-speaker setup had been used in earlier experiments in Mainz [
24,
25,
26,
27].
The loudspeaker array was located on one side of a large lab space (15.00 m × 7.05 m). To reduce interference from acoustic reflections, the laboratory area containing the loudspeaker array (8.44 m × 7.05 m) was sound-treated. It was separated from the other side of the lab space by sound-absorbing acoustic curtains (Gerriets Bühnenvelours Ascona 570; 570g/m2; absorption coefficient of 0.95 at frequencies above 400 Hz). A 20 cm thick layer of Basotect G+ (BASF; absorption coefficient of 0.95 at frequencies above 125 Hz) was attached to the walls and ceiling. To reduce reflections from the floor, a carpet (High-pile, IKEA Stoense) was placed inside the array, on top of a 7 mm layer of felt. In addition, 10 cm thick Basotect G+ panels (BASF; absorption coefficient of 0.95 at frequencies above 400 Hz) were added on top of the carpet.
The TASCAR Speaker Calibration Tool [
28], alongside a sound level meter (Nor131, Norsonic, Oelde, Germany) and a free-field microphone (MP40, Roga, Gotha, Germany), positioned at the center of the loudspeaker array at a height of 165 cm above the floor, was used for the calibration procedure. This ensured compensation for level and spectral differences between each loudspeaker and calibrated sound pressure levels for both point-source and diffuse sound field scenarios.
2.2. Scene Creation
2.2.1. Sub-Source Recordings in a Real Environment
For the recording-based simulations, acoustic recordings were made on an asphalt test track at the Technical University of Darmstadt by Oberfeld et al. [
24]. These recordings were used in the experiment. Recordings had been made using two small Kia passenger cars: an internal-combustion-engine vehicle (ICEV) represented by a gasoline-powered Kia Rio 1.0 T-GDI (2019, 1.0 L, 88 kW, 3 cylinders) with manual transmission, equipped with Continental summer tires (ContiSportContact 5, 205/45 R17), and a Kia e-Niro (2019, 150 kW) electric vehicle (EV) with Michelin summer tires (Primacy 3, 215/55 R17) and an Acoustic Vehicle Alerting System (AVAS), which was active for speeds of up to 28 km/h. To capture tire–road noise and powertrain noise, four free-field microphones (MI-17, Roga, Gotha, Germany) were mounted severally on both front tires, the right back tire, and centrally on the engine hood. A GPS antenna (Trimble AG25) was installed centrally on the vehicle’s roof and was connected to a high-performance GPS receiver (JAVAD Triumph LS, recording rate 10 Hz) inside the vehicle. Using the Real Time Kinematic (RTK) method, the GPS position of the vehicle on the test track was recorded with a precision of a few centimeters [
29]. The high precision of the method is achieved by evaluating the carrier phase of the satellite signals, processing signals from at least 5 satellites, and matching the data from the mobile receiver with data from a geostationary reference station. We used a reference station provided by the Hessian State Office for Land Management and Geoinformation within the framework of SAPOS-GPPS (
https://sapos.hvbg.hessen.de/), located at a distance to the test track of about 6 km. A dummy head (4100D, Brüel & Kjaer, Kopenhagen, Denmark) was placed 0.5 m from the side of the road at a height of one meter to record the real driving-by sounds in the real test-track environment. Recordings from this dummy head corresponding to exactly the same pass-by trials as the source recordings used as input for the simulation systems were presented as an additional condition in this experiment.
2.2.2. Processing of the Scenes for the VR Approach
With the sub-source recordings, it is possible to transfer traffic scenarios to a virtual environment (VE). The TASCAR (v0.230) software application [
28] enables dynamic processing of the acoustic scene geometry, allowing for the placement of sound sources in a time-dependent manner. At each time step, TASCAR models the sound propagation between sources and receivers, and thus provides physically plausible simulations of various acoustic phenomena, including the directional characteristics of sound sources, distance-dependent changes in sound level due to spherical spreading and air absorption, and the time-varying sound travel time (which may induce Doppler effects). Moreover, TASCAR simulates sound reflections on surfaces such as the ground, utilizing the image sound source method [
30]. It models time-variant comb-filter effects resulting from acoustical interference between reflected and direct sound. Acoustical effects, focusing on the test-track reflections and effects detailed before, were configured in TASCAR. The recorded sound sources from the test track were positioned in the VE based on the distances between microphones in the real-world setting [
24]. At the JGU Mainz, TASCAR was used for the dynamic vehicle simulations and renderings.
At TUD, the IONSONO system cannot be controlled by TASCAR; this necessitated the development of a MATLAB toolbox [
20]. This toolbox processes TASCAR scene definition files and, using the same calculations as TASCAR, which are provided in [
28], runs the same vehicle simulations, but with an output that can be used by the IONSONO system.
For all vehicle recordings, the time series GPS and position data are stored in a .csv file, thereby capturing dynamic information about the position of the vehicle. Furthermore, four .wav files containing audio recordings of discrete sub-sources, in this case, tire and engine-hood sounds, are available in the same directory. Each sub-source of the .tsc file (a special file format used for TASCAR) contains key parameters, such as calibration level, positional offset along the vehicle, and microphone characteristics. In MATLAB, the .wav files for each sub-source were loaded and stored as discrete audio objects. Furthermore, parameters for environmental properties, such as reflections, were also configured by mirroring the sub-sources. Therefore, a rectangular reflector was defined with a specific width, height, damping value, and reflectivity. In order to emulate realistic sound propagation, each sub-source was virtually mirrored across the reflective surface, and a first-order low-pass filter was applied to approximate acoustic reflection effects, in accordance with the methodology outlined by Grimm [
28]. Subsequently, GPS coordinates and time vectors from the .csv file were transferred into a structured MATLAB format, facilitating precise temporal alignment within the simulation. These elements, along with the combined sounds from the reflector and sub-sources, were compiled into a vehicle object. In the next phase, individual start and stop times were established, and trajectory and sound data were cut accordingly. Following this, room-specific acoustic adjustments were simulated by applying a predefined room equalization filter curve from the WFS laboratory, as described earlier. Calculations were also performed to determine distance delays for each sub-source sound, considering the exact spatial distance between each source and the listener, with the total distance being computed as the vector sum of the longitudinal and diagonal components. Finally, to model air damping, equations from Grimm [
28] were applied to the audio signal, simulating natural attenuation over distance. The processed audio frames, each connected with their exact trajectory coordinates, were stored in a MATLAB struct, preparing them for playback on the IOSONO system at TUD, which facilitated the spatialized auditory experience of the simulation. The single sources were played back as point sources. In general, this toolbox allows for the adjustment of parameters for acoustic effects, position data, and sound sources within vehicle scenes. Its functionality is orientated on that of TASCAR Toolbox.
2.3. Recording Setup in Laboratories
In the listening experiment, participants evaluated the quality of the various playback methods in a single experimental session, with recordings being played back via headphones. For this, recordings were made of the acoustic simulations of the driving scenes. Binaural recordings were made using a dummy head (HSU III.3, Head acoustics, Herzogenrath, Germany) equipped with a shoulder unit. It was positioned at the height of the loudspeaker arrays and with the shoulder unit aligned perpendicular to the direction of travel of the road in the virtual scene. The dummy head, characterized by two ICP condenser microphones and featuring an individual pinna design, was connected to the labHSU Interface of Head acoustics. This setup allowed for the application of various equalization curves to process the binaural signal, reducing effects such as reflections from the shoulder section. The resulting filtered signal can then be treated as a measurement signal for later analysis. Different equalization filter curves, including diffuse field (DF), free field (FF), and independent of direction (ID), are available. As the laboratories are not treated as anechoic chambers and do not represent free-field conditions, the DF filter was employed. Recording signals were captured using the ArtemiS Suite. It is crucial to precisely determine when the vehicle passes by the reference point at the listener’s position. At the JGU Mainz, a sinus-type marker signal was recorded to indicate this moment through an impulse. At TUD the scenes were recorded in a simulated environment, with a 15 s interval before and a 5 s interval after the reference point; the scenes were recorded one after another in a single cumulative recording. Subsequently, the individual scenes were extracted from the recording, using time windows of 20 s. This process may have resulted in minor deviations. With this setup and these editing techniques, the driving-by stimuli were recorded in both laboratories.
In addition to the recordings inside the speaker arrays made with the Head Acoustics dummy head, the experiment incorporated recordings of a virtual dummy-head receiver that was rendered via TASCAR. This simulates a binaural recording and is based on the Spherical Head Model (SHM). The head is modeled as a rigid sphere in this approach, and it includes effects like head shadow and interaural time differences. The height of the receiver was set to one meter, to match the listener’s position of the dummy head placed on the test track. The listener was placed 0.5 m from the side of the road, with their shoulders placed orthogonal to the road’s direction. In this study, a reference point at the listener’s position will be relevant. The vehicle will pass that reference point, marked as a red line in
Figure 1, at the height of the listener.
2.4. Preparing Listening Test Scenes
For the following listening test, dummy-head recordings were used. They represent the played-back driving-by-scenes associated with five different recording conditions: (1) dummy-head recordings made using the 41-channel HOA and VBAP rendering at JGU Mainz, (2) dummy-head recordings made using the reduced 16-speaker HOA rendering at JGU Mainz, (3) dummy-head recordings made using the Wave Field Synthesis environment at TUD, (4) virtual recordings made of a binaural receiver in TASCAR (i.e., purely virtual), and (5) the original dummy-head recordings which were made on the test track using real vehicles. These original dummy-head recordings were made synchronously with the microphone-on-chassis recordings that were used for the virtual vehicle renderings. During the recording process of the original recordings, various background noises, such as airplanes and loud bird calls, were present. Subsequently, only the scenes that seemed suitable, devoid of disruptive noises, were selected. Also, during most recording sessions, there was a significant amount of wind, which likely could have affected the overall level of the vehicle sound arriving at the dummy head on the test track, but also might have induced level fluctuations due to changing wind speed or wind direction during an approach of the vehicle on the test track. Covering a wide range of different vehicle-driving-by sounds and dynamics, scenes of an electric vehicle and combustion-engine vehicle with different constant speeds were selected. Due to the primary scene selection, caused by the disturbing noises in the recording session, only a limited number of scenes could be used for the subsequent listening test. Therefore, the velocity parameter was not consistent as to both types of vehicles. But for later quality comparisons between the playback setups, this selection of scenes is convenient, because of the different speed and vehicle type variations. Since the original recordings were made using a Brüel & Kjaer 4100D dummy head, we applied a DF filter to account for differences between the B&K and the Head Acoustics dummy head. In the optimized parametric HRTF model, introduced from Schwark et al. [
31], a DF equalization was already applied.
Table 1 specifies the scene-parameters that were present for all of the five playback systems.
A time window spanning 5 s before the car reaches the listener was set, as indicated by the red line in
Figure 1. The scenes were captured using the setup described above and edited using ArtemiS Suite (v13.6). Following an analysis of the level diagrams of the recorded signals, it was observed that some differences in the reproduced SPL persisted despite system calibration. In light of the numerous studies that have explored the significance of maximum SPL in scene perception, the normalization process focused on attaining the maximum level. To account for both perception and frequency-dependent hearing, A-weighted levels were employed, and the scenes were adjusted to maintain consistent maximum SPL(A) values. The mean values of the recorded scenes for each vehicle type and velocity were computed and a defined maximum value was established. This approach not only ensured uniformity but also guaranteed the reliability and accuracy of the presentation of the stimuli, thereby enhancing the validity of subsequent analyses and the overall study findings.
2.5. Signal Analysis
To provide a concrete example, a comparison of the adjusted scenes demonstrates the implications of aligning A-weighted levels. In the alignment process, the maximum mean level values of both dummy-head ear recordings were matched across the different reproduced scenes. However, even with these adjustments, we observed variations of up to 5 dB in the quieter sections of the scenes.
Figure 2 shows the A-weighted levels of the two dummy-head microphones for each played-back vehicle scene.
The two different simulation toolboxes (TASCAR versus TUD MATLAB toolbox), different recording setups in the laboratories and on the test track, and the transfer functions of the three different binaural receivers may have led to differences in the levels that were reproduced. Furthermore, the different room treatments of the laboratories and the semi-free-field condition on the test track played important roles. In addition, a significant proportion of the recordings on the test track were made at higher wind speeds, which likely caused relatively strong wind-related level fluctuations in the vehicle sound recorded by the dummy head on the test track. In contrast, due to the close proximity of the microphones attached to the vehicle’s chassis to the vehicle sound sources, these recordings were virtually unaffected by wind speed, and we excluded recordings by the microphones mounted on the vehicle that contained wind noise.
To obtain a first impression of the reproduction-based characteristics of the recorded scenes, defined psychoacoustical values were calculated, as can be seen in
Table 2.
For the analysis, sharpness was calculated according to Aures and loudness according to ISO 532-1 [
32]. When comparing the max SPL and max A-weighted SPL, the varying SPL maximum value becomes evident. See, for instance, the variation in the SPL of the ICEV_10kmh scenes and the large difference between max SPL and max A-weighted SPL. This indicates that (a) there is quite a bit of low-frequency content in this scene (since A-weighting particularly reduces the level in this frequency range, resulting in large differences between SPL and A-weighted SPL) and (b) the different playback methods differ in their ability to reproduce this low-frequency content (as visible from the differences in SPL). Additionally, sharpness values differ among the scenes. Sharpness arises from a high amount of high-frequency energy relative to the total. The highest sharpness values were provided by the test-track recordings. When comparing the 41-channel HOA-VBAP and WFS systems, differences are apparent, with WFS exhibiting slightly higher sharpness values. Therefore, when comparing the 41-channel HOA-VBAP system with the WFS, it is evident that the WFS system tends to reproduce sound with a different high- and low-frequency ratio. The loudness values are generally similar across the playback environments. Combined, these values highlight differences in audio reproduction across various playback systems and real-world recordings. The variations in sharpness and loudness suggest that laboratory simulations and real-world scenarios can yield different perceptual experiences. In the following listening test, we investigated whether different level or frequency characteristics have an influence on the perception of the driving-by-scenes.
2.6. Listening Test
2.6.1. Attributes
While existing studies often focus on aspects such as detectability or general unpleasantness, the objective of this paper is to explore a broad perceptual spectrum of vehicle scenes. The literature offers a structural framework for investigating the characteristics of 3D playback systems, which guided the selection of attributes in this study [
7]. Using different perceptual dimensions, the chosen attribute groups included source location, timbral balance, loudness, room-related attributes, and naturalness (or a realistic feeling). These attributes were selected based on their intuitive rating potential when evaluating driving-by-scenes. Additionally, an extra attribute related to velocity perception was incorporated, considering its critical role in safety-relevant investigations. In the subsequent table, the attribute groups and the specific attributes are delineated.
Participants were prompted to assess these attributes, utilizing a rating scale ranging from 0 to 100, as shown in
Figure 3.
2.6.2. Playback System
To ensure that differences in the environments would not affect the results, we played back the processed recordings via headphones (HD650, Sennheiser, Wedemark, Germany).
2.6.3. Procedure
Participants were instructed to assume a comfortable seating position before participating in the experiment. Prior to the actual rating tasks, they received general information about the experimental procedures and were given the opportunity to familiarize themselves with the perceptual attributes. During this phase, participants were encouraged to ask questions and clarify any uncertainties regarding the attributes presented. Once ready, participants put on their headphones, and the experiment unfolded in two distinct parts, each comprising 30 stimuli (the 30 unique scenes) and focusing on three perceptual attributes. In the first part, participants evaluated the attributes dark–bright, realistic–unrealistic, and soft–loud. The second part encompassed the assessment of all remaining attributes, as detailed in
Table 3. Before each experimental segment, participants engaged in a training session featuring eight scenes from the thirty possible scenes. The training sessions incorporated scenes from each playback system, including both the quietest and loudest stimuli, to expose participants to a wide range of scene attributes. While the results of these training trials did not contribute to the final analysis, they served the purpose of acquainting participants with the rating scale and the diverse characteristics of the scenes. Following the training phase, the main experiment commenced and spanned approximately 35 min. The experimental structure is shown in
Figure 4.
2.6.4. Participants
A total of 24 participants (18 male, 6 female) voluntarily participated in the listening study. The participants’ ages ranged from 19 to 61 years (M = 30 years; SD = 10 years) and all were self-reported normal-hearing listeners. This experiment was approved by the ethics committee of the TU Dresden, with the number SR-EK-68022020.
4. Discussion
Given that the majority of safety-relevant experiments in the field of automotive development are conducted in simulation environments, a variety of laboratory approaches have emerged as the prevailing methodologies. The present study compares and contrasts the characteristics of two simulation toolboxes (TASCAR and TUD-MATLAB toolbox) and two playback systems, namely, HOA-VBAP and WFS. On top of that, the test-track and virtual dummy-head recordings were considered as well. The ideal scenario would be to see no differences in the perception of audio reproduction between the different laboratory environments. In fact, analyses of model-based psychoacoustic parameters such as sharpness and listeners ratings of various sound attributes showed a high degree of similarity between the reproduction approaches. However, we also found some significant differences between the reproduction systems for certain attribute ratings. Before playing the test-track scenes back over the Head labO2 Interface with the DF playback equalizer, an individual DF filter for the Brüel & Kjaer 4100D dummy head was used for processing the signal, as had already been performed with the recorded signals with the Head HSU III.3 dummy head. Despite the utilization of analogous processing techniques to minimize discrepancies between the recordings of the two dummy heads, it was not possible to eliminate them entirely. However, the sharpness of the test-track recordings was higher, compared to the recordings in the laboratories (see
Table 2). This underscores the fact that the higher frequency content is remarkably dominant for the test-track scenes. The 16ch HOA system showed results similar to the WFS system, whereas the 41ch HOA-VBAP system was perceived as tendentially darker, presumably due to the additional subwoofer. The ICEV scenes were perceived generally as darker with decreasing velocity. One potential explanation for this phenomenon is the engine orders of the combustion engine. As the rotational speed decreases, the frequency level also declines. This is particularly noticeable in the 41ch HOA-VBAP system, in which it is played back with greater prominence. Psychoacoustic metrics, such as sharpness, underline the differences in the timbral balance of the different reproduction methods and the test-track recordings. This suggests that in future record-based simulations, it could be useful to compare the frequency spectrum recorded at the listener’s position in the laboratories with the spectrum of actual test-track dummy-head recordings. Such a step ensures that the simulation closely mirrors the real traffic scenario. However, precise comparisons are only possible when, during the test-track dummy-head recordings, the ambient noise level is very low, and on windless days. Both criteria are quite difficult to meet. In fact, in the present analyses, the test-track dummy-head recordings contained significant ambient noise and were affected by wind, which likely affected all psychoacoustic metrics as well as the listening test results. Thus, it is important to treat these results with caution and view them as indicative rather than definitive. The use of two different dummy-head models for the real test-track recordings and the laboratory recordings introduced additional deviations.
In addition to that, the observed differences can probably be attributed to the differing loudspeaker configurations, circular setups, and reproduction methods employed in each laboratory. These variations can lead to slight differences in the transfer functions of the sounds, subsequently influencing perception, as already investigated in previous playback method comparisons [
6,
9]. Additionally, the WFS system utilizes a unique filtering method developed to balance its frequency response against an ideal pink noise reference. This method, based on third-octave band filter magnitudes, produces a filter curve with limited accuracy, potentially causing minor differences in the reproduced signal. When examining the dynamic perception of stimuli, these findings align with known issues from previous research. A prior study revealed that TTCs were underestimated in the WFS system compared to the 16-channel HOA system, despite the use of identical stimuli [
34]. An erroneous level calibration was suggested as a possible explanation. However, in the present study, special attention was given to the detailed calibration of the systems. By matching the maximum A-weighted levels and examining small deviations, efforts were made to ensure reproduction accuracy. Nevertheless, the velocity perception results of the ICEV scenes indicated a similar trend: participants tended to perceive approaching vehicles as faster in the WFS compared to the 16ch HOA approach. Why this issue occurs ultimately cannot be explained. Previous investigations have already revealed that there are differences between sound reproductions within different playback methods, especially in coloration [
6,
9]. As such differences occur also in the context of the laboratories in Mainz and Dresden, future studies should explore whether in general this has an impact on safety-relevant decisions made by pedestrians.
Fairly noticeably, at higher velocities, test-track scenes were rated as more realistic and slightly more open, compared to their simulated counterparts. The most likely explanation for this result is that ambient background noise (birds, aircraft noise, distant traffic, etc.) was present on the test-track recordings but not in the simulated vehicle approaches. One potential improvement could be to record the ambient noise of the real environment and play it back alongside the sound of the reproduced vehicle scenes to better approximate real-world conditions.
While all of these attributes could be rated in a satisfactory manner, the participants tended to have problems in rating the room-related attributes, as their rating values are mostly around 50%, indicating that they did not distinctly choose one extreme or the other. Contrastingly, “limited–open” showed at least some tendency, e.g., the test-track recordings were rated as being slightly more open in the vehicle scenes; especially, the “sounds bigger” attribute had larger 95% confidence intervals and was mainly rated in the 40 to 60 percent range. Neither of these attribute ratings were significantly influenced by the different playback methods, so the results are more neutral, rather than showing actual tendencies in the localization ability or room impression The labs were treated differently; having listened carefully to the drive-by scenes, it was expected that these attributes would vary between the different playback systems. As this was not the case, future work should focus on the selection of better attributes for investigating spatial perception in audio reproduction methods.
Beyond the comparison of the general attributes, the analysis of loudness ratings also revealed interesting tendencies. Despite equal A-weighted levels of the vehicle scenes, participants perceived the EV scenes with a constant velocity of 20 km/h as quieter than those of the ICEV. It appears that rating the loudness of electric vehicle scenes may have been more challenging for participants, as these vehicles do not emit the typical engine sounds that pedestrians are accustomed to, in traffic dominated by combustion-engine vehicles. This finding is consistent across playback systems, emphasizing the challenge pedestrians face in perceiving electric vehicles compared to conventional combustion-engine vehicles. The lesser loudness of the EVs observed in the present study might imply an impaired perception of the electric vehicles. Other studies, such as [
35,
36], underline this assumption. However, although current regulations and guidelines for AVAS warning sounds focus on specific noise levels [
37,
38], this study demonstrates that the participants rated the EV scenes as being quieter. Therefore, when developing AVAS sounds, their detectability must be ensured. As Bazilinskyy et al. [
39] suggest, one potential method for enhancing the ability to recognize a vehicle is through an increase in the loudness of the vehicle’s sound. That could be feasible, and would ensure perceptually comparable sound levels of the EV relative to those of combustion-engine vehicles. Furthermore, it could enhance pedestrian safety and address the auditory challenges posed by the increasing prevalence of electric vehicles.
Finally, it is important to note that despite the moderate differences in perceived sound attributes between the playback systems observed in the present study, this does not automatically imply that, e.g., time-to-collision estimation [
24] or street-crossing decisions [
25] can be expected to differ between the systems. For instance, even when the vehicle sounds are perceived as somewhat brighter in one system compared to another system, participants could be equally able to gather information about the motion of the approaching vehicle from the dynamic changes in loudness or azimuthal angle provided by auditory stimulus [
40]. Thus, additional research is required to investigate if behavioral tasks such as TTC estimation or street-crossing decisions also show differences between the playback systems.