Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values

Krautwurm, Jonas; Oberfeld-Twistel, Daniel; Huisman, Thirsa; Maravich, Maria Mareen; Altinsoy, Ercan

doi:10.3390/acoustics7010007

Open AccessArticle

Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values

by

Jonas Krautwurm

^1,*

,

Daniel Oberfeld-Twistel

²

,

Thirsa Huisman

²

,

Maria Mareen Maravich

¹ and

Ercan Altinsoy

¹

Chair of Acoustics and Haptics, Faculty of Electrical and Computer Engineering, TUD Dresden University of Technology, Helmholtzstraße 18, 01069 Dresden, Germany

²

Section Experimental Psychology, Institute of Psychology, Johannes Gutenberg-Universität Mainz, Wallstrasse 3, 55122 Mainz, Germany

^*

Author to whom correspondence should be addressed.

Acoustics 2025, 7(1), 7; https://doi.org/10.3390/acoustics7010007

Submission received: 18 December 2024 / Revised: 6 January 2025 / Accepted: 24 January 2025 / Published: 8 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

Traffic safety experiments are often conducted in virtual environments in order to avoid dangerous situations and conduct the experiments more cost-efficiently. This means that attention must be paid to the fidelity of the traffic scenario reproduction, because the pedestrians’ judgments have to be close to reality. To understand behavior in relation to the prevailing audio rendering systems better, a listening test was conducted which focused on perceptual differences between simulation and playback methods. Six vehicle driving-by-scenes were presented using two different simulation methods and three different playback methods, and binaural recordings from the test track acquired during the recordings of the vehicle sound sources for the simulation were additionally incorporated. Each vehicle driving-by-scene was characterized by different vehicle types and different speeds. Participants rated six attributes of the perceptual dimensions: “timbral balance”, “naturalness”, “room-related”, “source localization”, “loudness” and “speed perception”. While the ratings showed a high degree of similarity among the ratings of the sound attributes in the different reproduction systems, there were minor differences in the speed and loudness estimations and the different perceptions of brightness stood out. A comparison of the loudness ratings in the scenes featuring electric and combustion-engine vehicles highlights the issue of reduced detection abilities with regard to the former.

Keywords:

traffic; vehicle simulations; virtual reality; audio rendering systems

1. Introduction

For safe navigation through our environment, the human ability to perceive auditory cues is essential, especially in situations where potential collisions with objects, such as vehicles, must be avoided. For instance, pedestrians rely on their sense of hearing to detect approaching vehicles, particularly when the road and traffic are not clearly visible. This reliance on auditory information becomes even more important for visually impaired persons and also is of note in the context of electric-car development [1]. These increasingly unconventional sounds represent a unique challenge with respect to pedestrian safety. Conducting safety-relevant studies in a naturalistic setting would cause safety risks, and field experiments lead to more costs. On top of that, it is not possible to easily control parameters like vehicle types or environmental conditions. For these reasons, many traffic studies have been transferred to complex virtual-reality laboratories incorporating different auditory playback setups.

1.1. Overview of Modern Audio Playback Systems

Stereophony as a possible playback method is widely used and established because of the easy setup associated with its audio production [2]. The disadvantages are, e.g., the limited spatial listening area (sweet spot) and the localization blur. To create a more complex reproduction, loudspeaker arrays with different sound synthesis techniques, such as Wave Field Synthesis (WFS) and Higher Order Ambisonics (HOA), can be used [3]. These two playback systems have been seen as distinct sound spatialization methods, but recent insights [4] suggest that these audio rendering approaches are related as to some characteristics. They rely on loudspeaker arrays aimed to physically recreate the primary sound field and offer exact solutions to the sound wave equation. However, the different reproduction techniques (spherical harmonic expansion for the HOA; Kirchhoff–Helmholtz Integral for the WFS) also lead to several differences. The frequency above which spatial aliasing occurs, a key issue in sound field encoding, is calculated using the speed of sound

c_{s}

and the spacing between the speakers.

f_{s p} = \frac{c_{s}}{∆_{s p e a k e r}}

Especially in WFS, spatial aliasing can occur if the wavelengths of the sampled waves are smaller than the speaker spacing. Above this cut-off frequency, inaccuracies may occur in the audio reproduction. As a result, certain spatial details may be misrepresented or lost during the encoding process, leading to spatial aliasing. WFS systems aim to reproduce the sound accurately over the entire listening area within the array, at least to the point of the aliasing frequency. In contrast, HOA systems use spherical harmonic functions to encode the sound field, which allows for a more accurate representation of spatial information; this is also the case in higher frequency ranges. Due to the reliance on spherical harmonics, there is a limited area where the sound field is reproduced in a spatially correct manner, especially in higher frequencies. The listener has to be located in the center of the loudspeaker array.

1.2. Related Work on Quality Perception in Selected Playback Environments

To ensure similar results when conducting safety-relevant experiments in different playback environments, the vehicle scenes depicted should be perceived in a similar manner. Therefore, it is important to evaluate the playback quality of certain 3D audio rendering systems. Several studies have investigated the subjective differences and highlighted the importance of further studies on perceptions associated with both audio rendering approaches [5,6]. Rozenn et al. [7] highlighted the existence of perceptual differences between the playback setups and real-world environments. Therefore, it is essential to comprehend how participants perceive the sounds of these different setups. Singh et al. [8] compared the auditory perception of real vehicles to the perception of vehicles simulated in a virtual-reality environment. The participants were first exposed to the vehicles in a real-world setting and completed a paper-based questionnaire wherein they evaluated various attributes. These subjective attributes were categorized as “detection distance”, “recognizability”, and “detectability”, in addition to general impressions of the sound of the vehicle. Two months after the experiment in the real environment, the participants experienced the same driving-by stimuli, but now in a simulated scene in the laboratory. The audio simulation used vector-based amplitude panning (VBAP) to reproduce the vehicle sound over eight speakers. They again completed the same questionnaire, rating the various attributes of the sound. The participants evaluated the VR sounds as being similar to the real-world scenarios, but they tended to recognize the EV later in the laboratory. As further studies [9,10] have demonstrated using a number of different sound stimuli and various playback methods (e.g., via headphones or a variety of loudspeaker array dimensions, including 1D, 2D, and 3D), the perception of sound differs between playback methods. It was highlighted that, especially in a spatial context, these differences were particularly pronounced, but differences also occurred in terms of coloration and timbre; e.g., the sounds in the 3D configuration were perceived as more muffled. In addition to assessments of the timbral attributes of artificial sounds such as noise and speech, the perceived localization blur was also evaluated by listeners in the investigations of Wierstorf [11]. Kaplanis et al. [12] underline the importance of the room treatment in sound reproduction environments. Even with subtle differences influencing the reverberation time, the timbral and spatial perceptions of the participants were influenced.

1.3. Aims of the Present Study

As demonstrated in the studies described in the preceding paragraph, preliminary investigations were conducted into perceived quality attributes in selected playback environments. These investigations revealed the perception-based results for a limited number of quality attributes. To explore this topic of quality comparisons in auditory playback systems in greater depth, the present manuscript will address not only perceptual attributes, but also physical and psychoacoustic values, for the purpose of a subsequent comparison. As a first step, the research concentrated mainly on these values to perform an overview of the similarities and differences. The present investigation focused on an HOA-VBAP setting and a modern WFS system with a small speaker spacing, as they represent modern playback environments; the relevant question is whether the different playback systems reproduce the sounds similarly, or there are differences in the physical and psychoacoustic parameters. Several relevant psychoacoustic parameters, which were calculated from recordings made of the different simulation and playback methods and based on hearing models, were selected. In the next step, it was crucial to figure out whether the actual perception in the different laboratories differed. To understand the perceptual behavior of subjects in the playback environments, a listening test was conducted. Attributes out of different perceptual dimensions were chosen. Loudness-, timbre-, and room-related attributes were taken from studies and regulations such as those described in [13,14]. Besides these, parameters that influence safety-relevant decisions when crossing a street were also taken into account. As the other studies already revealed, speed estimations of pedestrians are connected to street crossing decisions and deliver relevant information [15,16]. As already explained before, the ideal playback environments should present the driving-by-scenes without affecting the perception of these scenes. We expected that there would be no significant differences between (1) the psychoacoustic and physical values associated with the recorded scenes in the different playback environments and (2) the perceptions of the chosen perceptual attributes. This study offers a more comprehensive understanding of the effectiveness of these advanced audio playback systems.

2. Materials and Methods

To obtain subjective evaluations from participants, listening tests were conducted. This involved simulating driving scenes in laboratories using a record-based simulation approach. To prepare stimuli for the listening test, driving-by-scenes were recorded in two simulation environments, utilizing a dummy-head setup (details below). The binaural audio recording allows a realistic playback in later listening experiments. In order to create an experimental setup allowing for the comparison of different laboratories, headphones were used.

2.1. Laboratories

As described in Section 1.1, playback environments, such as HOA and WFS systems, are able to reproduce a distinct sound spatialization, and also complex sound fields from real environments. Ref. [4] With less speaker spacing, the aliasing frequency of WFS systems increase, which causes the unbiased reproduction to be located within a higher frequency range. For this reason, the distances between the speakers in the WFS at TUD are set as small as possible. A higher order of a HOA system leads to more spherical harmonic functions. In such a scenario, the reproduced sound field provides a sound field approximation over a broader surface around the listener’s position. With the 15th order system at JGU Mainz, the reproduction occurs within a detailed sound field in the middle of the loudspeaker array.

2.1.1. Wave Field Synthesis (WFS) at TUD Dresden University of Technology

Perceived sound not only activates our sensations; it also acts as a carrier of environmental information. Virtual-reality technology offers real-time interaction with computer-generated environments, facilitating the flexible and cost-effective presentation of complex experimental scenarios. The Multi Modal Measurement Laboratory of TUD Dresden University of Technology [17] spans 24 square meters, and is characterized by a rectangular shape, with walls inclined by 3° to minimize fluttering echo. Acoustical treatments aligned with standards such as ITU-R BS1116.1 and DIN 15996 [18,19] have been attached to the walls and ceiling. Perforated metal sheets, covering 20% of the area, along with strategically placed Helmholtz resonators in the corners, provide an acoustically enhanced treatment over a wide frequency range. For audio reproduction, a WFS system from IOSONO was installed. The setup comprises 464 loudspeakers and 4 subwoofers, with each loudspeaker panel housing 6 tweeters and 2 mid-range speakers. To combat the spatial aliasing effect, the tweeters were placed with distances of 6 cm between them, minimizing the aliasing frequency to approximately 3000 Hz, i.e., in the higher frequency range. Each loudspeaker is driven individually, so in total, 468 separate channels are handled. The control unit oversees signal processing, communication with audio servers, and routing systems handling internal and external connections. Eight rendering PCs, executing the WFS algorithm in real-time, are housed in an acoustically isolated external server room, ensuring optimal performance. Calibration serves as a crucial step which ensures scenes are played back at their original level, reproducing real traffic conditions accurately. To calibrate the system, pink noise from 125 Hz to 8000 Hz with a sound pressure level (SPL) of 80 dB is used. This pink noise is projected from a specific point positioned eight meters in front of the listener and captured at the listener’s position using a free-field microphone (capsule type 4188; preamp type 2671, Brüel & Kjaer, Kopenhagen, Denmark). By comparing the recorded SPL with the calculated reference based on source distance and the principles of sound propagation, the difference in levels is identified. This facilitates adjustments made within the IOSONO control unit to align the overall reproduced volume with the calculated reference. To fine-tune the frequency spectrum of the system, an equalizer application developed by Beyer et al. [20] is utilized. This application allows for the creation of individual filters to balance sound characteristics that could arise from room acoustics and treatments such as metal sheets. By selecting pink noise as a reference sound and analyzing both the original stimuli and recorded audio files, the filter magnitudes for each third octave band can manually be adjusted by visually comparing Fast Fourier Transform (FFT) representations. These filter coefficients are stored for subsequent use, ensuring the signal is appropriately filtered before playback via the WFS system.

2.1.2. Combination of Higher Order Ambisonics (HOA) and 3D VBAP at Johannes Gutenberg University Mainz

In Mainz, a combination of HOA [3] and VBAP [21] was used. The direct sound and the reflections from the ground surface and house fronts were rendered separately. For the direct sound, we used 15th order 2D Ambisonics with max_Re decoding [22,23], played back via 32 Genelec 8020 DPM loudspeakers arranged in a circle with a 4.6 m diameter (spaced 11.25 degrees apart); these were positioned approximately at the participants’ ear height (speaker height 163 cm above the floor). Additionally, a subwoofer was used (Genelec 7360 APM; crossover frequency 62.5 Hz). To simulate the sound reflections from the ground surface and other acoustically reflective surfaces such as house fronts, we used 3D VBAP [21] with the full loudspeaker array, containing both the aforementioned 32 loudspeakers at ear height and an additional lower ring of 8 Genelec 8020DPM speakers (spaced 45 degrees apart and angled towards the listener’s head; ring diameter 4.6 m, speaker height 87 cm above the floor) and the subwoofer. The array was driven by daisy-chained Ferrofish A32 Pro (24-bit audio resolution, f_s = 44.1 kHz) and Ferrofish Pulse 16 (24-bit audio resolution, f_s = 44.1 kHz) audio converters. The Ferrofish A32 Pro received audio signals via 64-channel MADI from an RME HDSPEe MADI audio interface on a computer running TASCAR on Linux. The 32 loudspeakers of the ear-height ring were driven by the Ferrofish A32 Pro. The 8 loudspeakers of the lower ring and the subwoofer were driven by the Ferrofish Pulse 16. For the present experiment, additional dummy-head recordings were made in a configuration in which both the direct and reflected sound was rendered on a subset of 16 equally spaced loudspeakers in the circular array at ear height. A similar 16-speaker setup had been used in earlier experiments in Mainz [24,25,26,27].

The loudspeaker array was located on one side of a large lab space (15.00 m × 7.05 m). To reduce interference from acoustic reflections, the laboratory area containing the loudspeaker array (8.44 m × 7.05 m) was sound-treated. It was separated from the other side of the lab space by sound-absorbing acoustic curtains (Gerriets Bühnenvelours Ascona 570; 570g/m²; absorption coefficient of 0.95 at frequencies above 400 Hz). A 20 cm thick layer of Basotect G+ (BASF; absorption coefficient of 0.95 at frequencies above 125 Hz) was attached to the walls and ceiling. To reduce reflections from the floor, a carpet (High-pile, IKEA Stoense) was placed inside the array, on top of a 7 mm layer of felt. In addition, 10 cm thick Basotect G+ panels (BASF; absorption coefficient of 0.95 at frequencies above 400 Hz) were added on top of the carpet.

The TASCAR Speaker Calibration Tool [28], alongside a sound level meter (Nor131, Norsonic, Oelde, Germany) and a free-field microphone (MP40, Roga, Gotha, Germany), positioned at the center of the loudspeaker array at a height of 165 cm above the floor, was used for the calibration procedure. This ensured compensation for level and spectral differences between each loudspeaker and calibrated sound pressure levels for both point-source and diffuse sound field scenarios.

2.2. Scene Creation

2.2.1. Sub-Source Recordings in a Real Environment

For the recording-based simulations, acoustic recordings were made on an asphalt test track at the Technical University of Darmstadt by Oberfeld et al. [24]. These recordings were used in the experiment. Recordings had been made using two small Kia passenger cars: an internal-combustion-engine vehicle (ICEV) represented by a gasoline-powered Kia Rio 1.0 T-GDI (2019, 1.0 L, 88 kW, 3 cylinders) with manual transmission, equipped with Continental summer tires (ContiSportContact 5, 205/45 R17), and a Kia e-Niro (2019, 150 kW) electric vehicle (EV) with Michelin summer tires (Primacy 3, 215/55 R17) and an Acoustic Vehicle Alerting System (AVAS), which was active for speeds of up to 28 km/h. To capture tire–road noise and powertrain noise, four free-field microphones (MI-17, Roga, Gotha, Germany) were mounted severally on both front tires, the right back tire, and centrally on the engine hood. A GPS antenna (Trimble AG25) was installed centrally on the vehicle’s roof and was connected to a high-performance GPS receiver (JAVAD Triumph LS, recording rate 10 Hz) inside the vehicle. Using the Real Time Kinematic (RTK) method, the GPS position of the vehicle on the test track was recorded with a precision of a few centimeters [29]. The high precision of the method is achieved by evaluating the carrier phase of the satellite signals, processing signals from at least 5 satellites, and matching the data from the mobile receiver with data from a geostationary reference station. We used a reference station provided by the Hessian State Office for Land Management and Geoinformation within the framework of SAPOS-GPPS (https://sapos.hvbg.hessen.de/), located at a distance to the test track of about 6 km. A dummy head (4100D, Brüel & Kjaer, Kopenhagen, Denmark) was placed 0.5 m from the side of the road at a height of one meter to record the real driving-by sounds in the real test-track environment. Recordings from this dummy head corresponding to exactly the same pass-by trials as the source recordings used as input for the simulation systems were presented as an additional condition in this experiment.

2.2.2. Processing of the Scenes for the VR Approach

With the sub-source recordings, it is possible to transfer traffic scenarios to a virtual environment (VE). The TASCAR (v0.230) software application [28] enables dynamic processing of the acoustic scene geometry, allowing for the placement of sound sources in a time-dependent manner. At each time step, TASCAR models the sound propagation between sources and receivers, and thus provides physically plausible simulations of various acoustic phenomena, including the directional characteristics of sound sources, distance-dependent changes in sound level due to spherical spreading and air absorption, and the time-varying sound travel time (which may induce Doppler effects). Moreover, TASCAR simulates sound reflections on surfaces such as the ground, utilizing the image sound source method [30]. It models time-variant comb-filter effects resulting from acoustical interference between reflected and direct sound. Acoustical effects, focusing on the test-track reflections and effects detailed before, were configured in TASCAR. The recorded sound sources from the test track were positioned in the VE based on the distances between microphones in the real-world setting [24]. At the JGU Mainz, TASCAR was used for the dynamic vehicle simulations and renderings.

At TUD, the IONSONO system cannot be controlled by TASCAR; this necessitated the development of a MATLAB toolbox [20]. This toolbox processes TASCAR scene definition files and, using the same calculations as TASCAR, which are provided in [28], runs the same vehicle simulations, but with an output that can be used by the IONSONO system.

For all vehicle recordings, the time series GPS and position data are stored in a .csv file, thereby capturing dynamic information about the position of the vehicle. Furthermore, four .wav files containing audio recordings of discrete sub-sources, in this case, tire and engine-hood sounds, are available in the same directory. Each sub-source of the .tsc file (a special file format used for TASCAR) contains key parameters, such as calibration level, positional offset along the vehicle, and microphone characteristics. In MATLAB, the .wav files for each sub-source were loaded and stored as discrete audio objects. Furthermore, parameters for environmental properties, such as reflections, were also configured by mirroring the sub-sources. Therefore, a rectangular reflector was defined with a specific width, height, damping value, and reflectivity. In order to emulate realistic sound propagation, each sub-source was virtually mirrored across the reflective surface, and a first-order low-pass filter was applied to approximate acoustic reflection effects, in accordance with the methodology outlined by Grimm [28]. Subsequently, GPS coordinates and time vectors from the .csv file were transferred into a structured MATLAB format, facilitating precise temporal alignment within the simulation. These elements, along with the combined sounds from the reflector and sub-sources, were compiled into a vehicle object. In the next phase, individual start and stop times were established, and trajectory and sound data were cut accordingly. Following this, room-specific acoustic adjustments were simulated by applying a predefined room equalization filter curve from the WFS laboratory, as described earlier. Calculations were also performed to determine distance delays for each sub-source sound, considering the exact spatial distance between each source and the listener, with the total distance being computed as the vector sum of the longitudinal and diagonal components. Finally, to model air damping, equations from Grimm [28] were applied to the audio signal, simulating natural attenuation over distance. The processed audio frames, each connected with their exact trajectory coordinates, were stored in a MATLAB struct, preparing them for playback on the IOSONO system at TUD, which facilitated the spatialized auditory experience of the simulation. The single sources were played back as point sources. In general, this toolbox allows for the adjustment of parameters for acoustic effects, position data, and sound sources within vehicle scenes. Its functionality is orientated on that of TASCAR Toolbox.

2.3. Recording Setup in Laboratories

In the listening experiment, participants evaluated the quality of the various playback methods in a single experimental session, with recordings being played back via headphones. For this, recordings were made of the acoustic simulations of the driving scenes. Binaural recordings were made using a dummy head (HSU III.3, Head acoustics, Herzogenrath, Germany) equipped with a shoulder unit. It was positioned at the height of the loudspeaker arrays and with the shoulder unit aligned perpendicular to the direction of travel of the road in the virtual scene. The dummy head, characterized by two ICP condenser microphones and featuring an individual pinna design, was connected to the labHSU Interface of Head acoustics. This setup allowed for the application of various equalization curves to process the binaural signal, reducing effects such as reflections from the shoulder section. The resulting filtered signal can then be treated as a measurement signal for later analysis. Different equalization filter curves, including diffuse field (DF), free field (FF), and independent of direction (ID), are available. As the laboratories are not treated as anechoic chambers and do not represent free-field conditions, the DF filter was employed. Recording signals were captured using the ArtemiS Suite. It is crucial to precisely determine when the vehicle passes by the reference point at the listener’s position. At the JGU Mainz, a sinus-type marker signal was recorded to indicate this moment through an impulse. At TUD the scenes were recorded in a simulated environment, with a 15 s interval before and a 5 s interval after the reference point; the scenes were recorded one after another in a single cumulative recording. Subsequently, the individual scenes were extracted from the recording, using time windows of 20 s. This process may have resulted in minor deviations. With this setup and these editing techniques, the driving-by stimuli were recorded in both laboratories.

In addition to the recordings inside the speaker arrays made with the Head Acoustics dummy head, the experiment incorporated recordings of a virtual dummy-head receiver that was rendered via TASCAR. This simulates a binaural recording and is based on the Spherical Head Model (SHM). The head is modeled as a rigid sphere in this approach, and it includes effects like head shadow and interaural time differences. The height of the receiver was set to one meter, to match the listener’s position of the dummy head placed on the test track. The listener was placed 0.5 m from the side of the road, with their shoulders placed orthogonal to the road’s direction. In this study, a reference point at the listener’s position will be relevant. The vehicle will pass that reference point, marked as a red line in Figure 1, at the height of the listener.

2.4. Preparing Listening Test Scenes

For the following listening test, dummy-head recordings were used. They represent the played-back driving-by-scenes associated with five different recording conditions: (1) dummy-head recordings made using the 41-channel HOA and VBAP rendering at JGU Mainz, (2) dummy-head recordings made using the reduced 16-speaker HOA rendering at JGU Mainz, (3) dummy-head recordings made using the Wave Field Synthesis environment at TUD, (4) virtual recordings made of a binaural receiver in TASCAR (i.e., purely virtual), and (5) the original dummy-head recordings which were made on the test track using real vehicles. These original dummy-head recordings were made synchronously with the microphone-on-chassis recordings that were used for the virtual vehicle renderings. During the recording process of the original recordings, various background noises, such as airplanes and loud bird calls, were present. Subsequently, only the scenes that seemed suitable, devoid of disruptive noises, were selected. Also, during most recording sessions, there was a significant amount of wind, which likely could have affected the overall level of the vehicle sound arriving at the dummy head on the test track, but also might have induced level fluctuations due to changing wind speed or wind direction during an approach of the vehicle on the test track. Covering a wide range of different vehicle-driving-by sounds and dynamics, scenes of an electric vehicle and combustion-engine vehicle with different constant speeds were selected. Due to the primary scene selection, caused by the disturbing noises in the recording session, only a limited number of scenes could be used for the subsequent listening test. Therefore, the velocity parameter was not consistent as to both types of vehicles. But for later quality comparisons between the playback setups, this selection of scenes is convenient, because of the different speed and vehicle type variations. Since the original recordings were made using a Brüel & Kjaer 4100D dummy head, we applied a DF filter to account for differences between the B&K and the Head Acoustics dummy head. In the optimized parametric HRTF model, introduced from Schwark et al. [31], a DF equalization was already applied. Table 1 specifies the scene-parameters that were present for all of the five playback systems.

A time window spanning 5 s before the car reaches the listener was set, as indicated by the red line in Figure 1. The scenes were captured using the setup described above and edited using ArtemiS Suite (v13.6). Following an analysis of the level diagrams of the recorded signals, it was observed that some differences in the reproduced SPL persisted despite system calibration. In light of the numerous studies that have explored the significance of maximum SPL in scene perception, the normalization process focused on attaining the maximum level. To account for both perception and frequency-dependent hearing, A-weighted levels were employed, and the scenes were adjusted to maintain consistent maximum SPL(A) values. The mean values of the recorded scenes for each vehicle type and velocity were computed and a defined maximum value was established. This approach not only ensured uniformity but also guaranteed the reliability and accuracy of the presentation of the stimuli, thereby enhancing the validity of subsequent analyses and the overall study findings.

2.5. Signal Analysis

To provide a concrete example, a comparison of the adjusted scenes demonstrates the implications of aligning A-weighted levels. In the alignment process, the maximum mean level values of both dummy-head ear recordings were matched across the different reproduced scenes. However, even with these adjustments, we observed variations of up to 5 dB in the quieter sections of the scenes. Figure 2 shows the A-weighted levels of the two dummy-head microphones for each played-back vehicle scene.

The two different simulation toolboxes (TASCAR versus TUD MATLAB toolbox), different recording setups in the laboratories and on the test track, and the transfer functions of the three different binaural receivers may have led to differences in the levels that were reproduced. Furthermore, the different room treatments of the laboratories and the semi-free-field condition on the test track played important roles. In addition, a significant proportion of the recordings on the test track were made at higher wind speeds, which likely caused relatively strong wind-related level fluctuations in the vehicle sound recorded by the dummy head on the test track. In contrast, due to the close proximity of the microphones attached to the vehicle’s chassis to the vehicle sound sources, these recordings were virtually unaffected by wind speed, and we excluded recordings by the microphones mounted on the vehicle that contained wind noise.

To obtain a first impression of the reproduction-based characteristics of the recorded scenes, defined psychoacoustical values were calculated, as can be seen in Table 2.

For the analysis, sharpness was calculated according to Aures and loudness according to ISO 532-1 [32]. When comparing the max SPL and max A-weighted SPL, the varying SPL maximum value becomes evident. See, for instance, the variation in the SPL of the ICEV_10kmh scenes and the large difference between max SPL and max A-weighted SPL. This indicates that (a) there is quite a bit of low-frequency content in this scene (since A-weighting particularly reduces the level in this frequency range, resulting in large differences between SPL and A-weighted SPL) and (b) the different playback methods differ in their ability to reproduce this low-frequency content (as visible from the differences in SPL). Additionally, sharpness values differ among the scenes. Sharpness arises from a high amount of high-frequency energy relative to the total. The highest sharpness values were provided by the test-track recordings. When comparing the 41-channel HOA-VBAP and WFS systems, differences are apparent, with WFS exhibiting slightly higher sharpness values. Therefore, when comparing the 41-channel HOA-VBAP system with the WFS, it is evident that the WFS system tends to reproduce sound with a different high- and low-frequency ratio. The loudness values are generally similar across the playback environments. Combined, these values highlight differences in audio reproduction across various playback systems and real-world recordings. The variations in sharpness and loudness suggest that laboratory simulations and real-world scenarios can yield different perceptual experiences. In the following listening test, we investigated whether different level or frequency characteristics have an influence on the perception of the driving-by-scenes.

2.6. Listening Test

2.6.1. Attributes

While existing studies often focus on aspects such as detectability or general unpleasantness, the objective of this paper is to explore a broad perceptual spectrum of vehicle scenes. The literature offers a structural framework for investigating the characteristics of 3D playback systems, which guided the selection of attributes in this study [7]. Using different perceptual dimensions, the chosen attribute groups included source location, timbral balance, loudness, room-related attributes, and naturalness (or a realistic feeling). These attributes were selected based on their intuitive rating potential when evaluating driving-by-scenes. Additionally, an extra attribute related to velocity perception was incorporated, considering its critical role in safety-relevant investigations. In the subsequent table, the attribute groups and the specific attributes are delineated.

Participants were prompted to assess these attributes, utilizing a rating scale ranging from 0 to 100, as shown in Figure 3.

2.6.2. Playback System

To ensure that differences in the environments would not affect the results, we played back the processed recordings via headphones (HD650, Sennheiser, Wedemark, Germany).

2.6.3. Procedure

Participants were instructed to assume a comfortable seating position before participating in the experiment. Prior to the actual rating tasks, they received general information about the experimental procedures and were given the opportunity to familiarize themselves with the perceptual attributes. During this phase, participants were encouraged to ask questions and clarify any uncertainties regarding the attributes presented. Once ready, participants put on their headphones, and the experiment unfolded in two distinct parts, each comprising 30 stimuli (the 30 unique scenes) and focusing on three perceptual attributes. In the first part, participants evaluated the attributes dark–bright, realistic–unrealistic, and soft–loud. The second part encompassed the assessment of all remaining attributes, as detailed in Table 3. Before each experimental segment, participants engaged in a training session featuring eight scenes from the thirty possible scenes. The training sessions incorporated scenes from each playback system, including both the quietest and loudest stimuli, to expose participants to a wide range of scene attributes. While the results of these training trials did not contribute to the final analysis, they served the purpose of acquainting participants with the rating scale and the diverse characteristics of the scenes. Following the training phase, the main experiment commenced and spanned approximately 35 min. The experimental structure is shown in Figure 4.

2.6.4. Participants

A total of 24 participants (18 male, 6 female) voluntarily participated in the listening study. The participants’ ages ranged from 19 to 61 years (M = 30 years; SD = 10 years) and all were self-reported normal-hearing listeners. This experiment was approved by the ethics committee of the TU Dresden, with the number SR-EK-68022020.

3. Results

3.1. Subjective Ratings of ICEV Driving-by Scenes

As shown in Figure 4 and Figure 5, the patterns of ratings for the six sound attributes generally showed a high degree of similarity between the playback systems, although there were differences. The first noticeable point is the interaction effects between the scene and playback factors. For the attributes “soft–loud” and “dark–bright”, the interaction effects become significant. They show a medium effect size, according to Cohen [33]. In contrast, the significant interaction effects for the attributes “slow–fast” and “realistic–unrealistic” show a medium-to-strong effect size. The most important influences of the playback systems on the attribute ratings are analyzed first. For the “soft–loud” attribute, there is no significant influence of the playback configurations on the perception. This was expected, because the scenes were (A) level-matched. Additionally, the different playback setups do not have a significant influence on the rating results of the room-related attributes. Despite this, the ratings of the “limited–open” attribute show some tendencies. Every test-track recording, especially those with increasing velocities, was rated as slightly more open, compared to the recordings of the other playback systems. For the “dark–bright” attribute, the effect of the playback system on the attribute rating is highly significant. The 41-channel HOA-VBAP configuration and the virtual binaural receiver simulating this setup were perceived as very dark. The test-track recordings were rated as the brightest, followed by the 16-channel HOA and WFS systems. The speed perception of the vehicle scenes, which is represented by the attribute “slow–fast”, is also significantly influenced by the different playback systems. Comparing speed perception values between the 16-channel HOA and WFS systems reveals that vehicles in the WFS system were perceived as slightly faster. It remains to be investigated if this was due to differences in sound playback (HOA vs. WFS) or differences in the simulation (TASCAR vs. TUD MATLAB toolbox).

The speed of the virtual dummy-head receiver recordings tended to be perceived as lower than that of the other playback systems. For speeds above 10 km/h, the speed perceptions in the 41-channel HOA-VBAP and WFS were nearly the same. Regarding the “realistic” attribute, the playback system had a significant influence. For speeds above 10 km/h, the test-track recording was perceived to be most realistic. It was thought that different speeds would affect the rating for attributes like dark–bright. This is because of the effects associated with different engine speeds and sounds. Further investigation is needed into how these scene parameters affect perception.

3.2. Subjective Ratings of EV Driving-by Scenes

The interaction effects of the “soft–loud”, “dark–bright”, and “realistic–unrealistic” attributes were significant. For the EV scenes, the influence of the playback systems on the attribute ratings was taken into account (Figure 6). Especially, the rating for the attribute “soft–loud” should be explained at the beginning. There were tendencies that illustrated the potential perceptual behavior of the participants. Notably, EV scenes were perceived as generally quieter than ICEV scenes. Specifically, the electric vehicle with the AVAS system switched on is perceived to be 10 to 20% quieter than the combustion-engine vehicle, when comparing scenes at the same constant velocity of 20 km/h, although the scenes containing different vehicle types were (A) level-matched. When analyzing the attribute “dark–bright”, the EV scene with the AVAS system switched on is perceived as significantly brighter. The AVAS sound is perceived to have distinct timbre characteristics compared to traditional engine sounds. When examining scenes with varying speed parameters, the test-track recording was perceived as brighter compared to the playback methods, a characteristic present in both ICEV and EV scenes. Furthermore, the “realistic–unrealistic” attribute revealed a consistent tendency in which the test-track recordings were perceived as more realistic, as was the case for the ICEV scenes. When comparing the rating results of the attribute “slow–fast”, it turns out that the vehicles in the WFS and test-track environment were perceived to be generally faster than those in the other playback environments.

3.3. Tables with the Results of the Statistic Analysis

The results were analyzed using an rmANOVA method. An α-level of 0.05 was used for all analyses. The results presented in Table 4 and Table 5 contain the rmANOVA results for every attribute. In the following subsections, the distributions of the groups are explained in a more detailed manner.

4. Discussion

Given that the majority of safety-relevant experiments in the field of automotive development are conducted in simulation environments, a variety of laboratory approaches have emerged as the prevailing methodologies. The present study compares and contrasts the characteristics of two simulation toolboxes (TASCAR and TUD-MATLAB toolbox) and two playback systems, namely, HOA-VBAP and WFS. On top of that, the test-track and virtual dummy-head recordings were considered as well. The ideal scenario would be to see no differences in the perception of audio reproduction between the different laboratory environments. In fact, analyses of model-based psychoacoustic parameters such as sharpness and listeners ratings of various sound attributes showed a high degree of similarity between the reproduction approaches. However, we also found some significant differences between the reproduction systems for certain attribute ratings. Before playing the test-track scenes back over the Head labO2 Interface with the DF playback equalizer, an individual DF filter for the Brüel & Kjaer 4100D dummy head was used for processing the signal, as had already been performed with the recorded signals with the Head HSU III.3 dummy head. Despite the utilization of analogous processing techniques to minimize discrepancies between the recordings of the two dummy heads, it was not possible to eliminate them entirely. However, the sharpness of the test-track recordings was higher, compared to the recordings in the laboratories (see Table 2). This underscores the fact that the higher frequency content is remarkably dominant for the test-track scenes. The 16ch HOA system showed results similar to the WFS system, whereas the 41ch HOA-VBAP system was perceived as tendentially darker, presumably due to the additional subwoofer. The ICEV scenes were perceived generally as darker with decreasing velocity. One potential explanation for this phenomenon is the engine orders of the combustion engine. As the rotational speed decreases, the frequency level also declines. This is particularly noticeable in the 41ch HOA-VBAP system, in which it is played back with greater prominence. Psychoacoustic metrics, such as sharpness, underline the differences in the timbral balance of the different reproduction methods and the test-track recordings. This suggests that in future record-based simulations, it could be useful to compare the frequency spectrum recorded at the listener’s position in the laboratories with the spectrum of actual test-track dummy-head recordings. Such a step ensures that the simulation closely mirrors the real traffic scenario. However, precise comparisons are only possible when, during the test-track dummy-head recordings, the ambient noise level is very low, and on windless days. Both criteria are quite difficult to meet. In fact, in the present analyses, the test-track dummy-head recordings contained significant ambient noise and were affected by wind, which likely affected all psychoacoustic metrics as well as the listening test results. Thus, it is important to treat these results with caution and view them as indicative rather than definitive. The use of two different dummy-head models for the real test-track recordings and the laboratory recordings introduced additional deviations.

In addition to that, the observed differences can probably be attributed to the differing loudspeaker configurations, circular setups, and reproduction methods employed in each laboratory. These variations can lead to slight differences in the transfer functions of the sounds, subsequently influencing perception, as already investigated in previous playback method comparisons [6,9]. Additionally, the WFS system utilizes a unique filtering method developed to balance its frequency response against an ideal pink noise reference. This method, based on third-octave band filter magnitudes, produces a filter curve with limited accuracy, potentially causing minor differences in the reproduced signal. When examining the dynamic perception of stimuli, these findings align with known issues from previous research. A prior study revealed that TTCs were underestimated in the WFS system compared to the 16-channel HOA system, despite the use of identical stimuli [34]. An erroneous level calibration was suggested as a possible explanation. However, in the present study, special attention was given to the detailed calibration of the systems. By matching the maximum A-weighted levels and examining small deviations, efforts were made to ensure reproduction accuracy. Nevertheless, the velocity perception results of the ICEV scenes indicated a similar trend: participants tended to perceive approaching vehicles as faster in the WFS compared to the 16ch HOA approach. Why this issue occurs ultimately cannot be explained. Previous investigations have already revealed that there are differences between sound reproductions within different playback methods, especially in coloration [6,9]. As such differences occur also in the context of the laboratories in Mainz and Dresden, future studies should explore whether in general this has an impact on safety-relevant decisions made by pedestrians.

Fairly noticeably, at higher velocities, test-track scenes were rated as more realistic and slightly more open, compared to their simulated counterparts. The most likely explanation for this result is that ambient background noise (birds, aircraft noise, distant traffic, etc.) was present on the test-track recordings but not in the simulated vehicle approaches. One potential improvement could be to record the ambient noise of the real environment and play it back alongside the sound of the reproduced vehicle scenes to better approximate real-world conditions.

While all of these attributes could be rated in a satisfactory manner, the participants tended to have problems in rating the room-related attributes, as their rating values are mostly around 50%, indicating that they did not distinctly choose one extreme or the other. Contrastingly, “limited–open” showed at least some tendency, e.g., the test-track recordings were rated as being slightly more open in the vehicle scenes; especially, the “sounds bigger” attribute had larger 95% confidence intervals and was mainly rated in the 40 to 60 percent range. Neither of these attribute ratings were significantly influenced by the different playback methods, so the results are more neutral, rather than showing actual tendencies in the localization ability or room impression The labs were treated differently; having listened carefully to the drive-by scenes, it was expected that these attributes would vary between the different playback systems. As this was not the case, future work should focus on the selection of better attributes for investigating spatial perception in audio reproduction methods.

Beyond the comparison of the general attributes, the analysis of loudness ratings also revealed interesting tendencies. Despite equal A-weighted levels of the vehicle scenes, participants perceived the EV scenes with a constant velocity of 20 km/h as quieter than those of the ICEV. It appears that rating the loudness of electric vehicle scenes may have been more challenging for participants, as these vehicles do not emit the typical engine sounds that pedestrians are accustomed to, in traffic dominated by combustion-engine vehicles. This finding is consistent across playback systems, emphasizing the challenge pedestrians face in perceiving electric vehicles compared to conventional combustion-engine vehicles. The lesser loudness of the EVs observed in the present study might imply an impaired perception of the electric vehicles. Other studies, such as [35,36], underline this assumption. However, although current regulations and guidelines for AVAS warning sounds focus on specific noise levels [37,38], this study demonstrates that the participants rated the EV scenes as being quieter. Therefore, when developing AVAS sounds, their detectability must be ensured. As Bazilinskyy et al. [39] suggest, one potential method for enhancing the ability to recognize a vehicle is through an increase in the loudness of the vehicle’s sound. That could be feasible, and would ensure perceptually comparable sound levels of the EV relative to those of combustion-engine vehicles. Furthermore, it could enhance pedestrian safety and address the auditory challenges posed by the increasing prevalence of electric vehicles.

Finally, it is important to note that despite the moderate differences in perceived sound attributes between the playback systems observed in the present study, this does not automatically imply that, e.g., time-to-collision estimation [24] or street-crossing decisions [25] can be expected to differ between the systems. For instance, even when the vehicle sounds are perceived as somewhat brighter in one system compared to another system, participants could be equally able to gather information about the motion of the approaching vehicle from the dynamic changes in loudness or azimuthal angle provided by auditory stimulus [40]. Thus, additional research is required to investigate if behavioral tasks such as TTC estimation or street-crossing decisions also show differences between the playback systems.

5. Conclusions

This investigation shows differences in the different actual sound reproduction methods with auditory-only stimuli. On the one hand, there are timbral and minor loudness perception differences within the different playback methods, but on the other hand, a stronger brightness perception associated with the real test-track scenes stood out. In addition to that, the velocity perceptions of the driving-by scenes show plausible results, although some minor deviations exist, while some could be assigned to the methods of previous studies. It still remains to be investigated whether coloration differences in reproduced driving-by scenes actually influence important decisions of pedestrians in traffic scenarios. In addition to the comparison of the reproduction methods, one general trend of the loudness perception ratings underlined the reduced detectability of electric vehicles. Especially when developing AVAS sounds, special attention should be given to the question of comparable sound levels between EV and ICEV.

Author Contributions

Conceptualization, J.K.; methodology, J.K., D.O.-T. and T.H.; investigation, J.K.; resources, D.O.-T. and T.H.; data curation, D.O.-T.; writing—original draft preparation, J.K.; formal analysis, J.K.; visualization, J.K.; supervision, E.A.; project administration, J.K.; funding acquisition, E.A. and D.O.-T.; writing—review and editing, T.H., D.O.-T., M.M.M. and E.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) AL 1473/13-1 (Ercan Altinsoy)/OB 346/8-1 (Daniel Oberfeld)—Project ID 444809588. The project is part of the priority program AUDICTIVE—SPP2236.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of Technische Universität Dresden (protocol code SR-EK-68022020 and date of approval 4 December 2020).

Informed Consent Statement

Informed consent was obtained from all participants involved in the study.

Data Availability Statement

The evaluation results of the participants can be received from the corresponding authors in an anonymized form.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wall Emerson, R.; Kim, D.S.; Naghshineh, K.; Pliskow, J.; Myers, K. Detection of Quiet Vehicles by Blind Pedestrians. J. Transp. Eng. 2013, 139, 50–56. [Google Scholar] [CrossRef]
Franco, A.F.; Merchel, S.; Pesqueux, L.; Rouau, M.; Sørensen, M.O. Sound Reproduction By Wave Field Synthesis; Faculty of Engineering and Science, Aalborg University: Aalborg, Denmark, 2004. [Google Scholar]
Blauert, J. (Ed.) 3-D-Lautsprecher-Wiedergabemethoden; DAGA: Dresden, Germany, 2008. [Google Scholar]
Daniel, J.; Moreau, S.; Nicol, R. (Eds.) Further Investigations of High-Order Ambisonics and Wavefield Synthesis for Holophonic Sound Imaging; Audio Engineering Society: Amsterdam, The Netherlands, 2003. [Google Scholar]
Spors, S.; Ahrens, J. (Eds.) A comparison of Wave Field Synthesis and Higher-Order Ambisonics with Respect to Physical Properties and Spatial Sampling; Audio Engineering Society: San Francisco, CA, USA, 2008. [Google Scholar]
Spors, S.; Wierstorf, H.; Raake, A.; Melchior, F.; Frank, M.; Zotter, F. Spatial Sound With Loudspeakers and Its Perception: A Review of the Current State. Proc. IEEE 2013, 101, 1920–1938. [Google Scholar] [CrossRef]
Nicol, R.; Gros, L.; Colomes, C.; Warusfel, O.; Noisternig, M.; Bahu, H.; Katz, B.F.G.; Simon, L.L.S. (Eds.) A Roadmap for Assessing the Quality of Experience of 3D Audio Binaural Rendering; Universitätsverlag der TU Berlin: Berlin, Germany, 2014. [Google Scholar]
Singh, S.; Payne, S.R.; Mackrill, J.B.; Jennings, P.A. Do experiments in the virtual world effectively predict how pedestrians evaluate electric vehicle sounds in the real world? Transp. Res. Part F Traffic Psychol. Behav. 2015, 35, 119–131. [Google Scholar] [CrossRef]
Guastavino, C.; Katz, B.F.G. Perceptual evaluation of multi-dimensional spatial audio reproduction. J. Acoust. Soc. Am. 2004, 116, 1105–1115. [Google Scholar] [CrossRef] [PubMed]
Moulin, S.; Nicol, R.; Gros, L. (Eds.) Spatial Audio Quality in Regard to 3D Video. In Acoustics; HAL: Nantes, France, 2012. [Google Scholar]
Wierstorf, H. Perceptual Assessment of Sound Field Synthesis. Dissertation; Technische Universität: Berlin, Germany, 2014. [Google Scholar]
Kaplanis, N.; Bech, S.; Lokki, T.; van Waterschoot, T.; Holdt Jensen, S. Perception and preference of reverberation in small listening rooms for multi-loudspeaker reproduction. J. Acoust. Soc. Am. 2019, 146, 3562. [Google Scholar] [CrossRef]
Berg, J.; Rumsay, F. Spatial Attribute Identification and Scaling by Repertory Grid Technique and other methods. In Proceedings of the AES International Conference: Spatial Sound Reproduction, Rovaniemi, Finland, 10–12 April 1999. [Google Scholar]
International Telecommunication Union. Methods for Selecting and Describing Attributes and Terms, in the Preparation of Subjective Tests; ITU: Geneva, Germany, 2017; BS.2399-0. [Google Scholar]
Störig, C.; Pörschmann, C. Investigations into Velocity and Distance Perception Based on Different Types of Moving Sound Sources with Respect to Auditory Virtual Environments. J. Virtual Real. Broadcast. 2014, 10, 1–22. [Google Scholar]
Pörschmann, C.; Störig, C. Investigations Into the Velocity and Distance Perception of Moving Sound Sources. Acta Acust. United Acust. 2009, 95, 696–706. [Google Scholar] [CrossRef]
Altinsoy, E.M.; Jekosch, U.; Merchel, S.; Landgraf, J. (Eds.) Progress of Auditory Perception Laboratories- Multimodal Measurement Laboratory of Dresden University of Technology; Audio Engineering Society: San Francisco, CA, USA, 2010. [Google Scholar]
Deutsches Institut für Normierung. DIN 15996: Bild- und Tonbearbeitung in Film-, Video- und Rundfunkbetrieben—Grundsätze und Festlegungen für den Arbeitsplatz; Beuth Verlag GmbH: Berlin, Germany, 2020. [Google Scholar]
BS.1116-1; Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems. ITU Radiocommunication Bureau: Geneva, Switzerland, 2003.
Beyer, F.; Fischer, S.; Steinbach, L.; Altinsoy, M.E. (Eds.) Comparison of Recorded and Synthesized Stimuli of Trac Scenarios in an Auditory Virtual Reality Environment Using Wave Field Synthesis; DAGA: Hamburg, Germany, 2023. [Google Scholar]
Pulkki, V. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc. 1997, 45, 456–466. [Google Scholar]
Daniel, J. Représentation de Champs Acoustiques, Application à la Transmission et à la Reproduction de Scènes Sonores Complexes Dans un Contexte Multimédia. Ph.D. Thesis, Université Pierre et Marie Curie, Paris, France, 2000. [Google Scholar]
Gerzon, M.A. Ambisonics in multichannel broadcasting and video. J. Audio Eng. Soc. 1985, 33, 859–871. [Google Scholar]
Oberfeld, D.; Wessels, M.; Büttner, D. Overestimated time-to-collision for quiet vehicles: Evidence from a study using a novel audiovisual virtual-reality system for traffic scenarios. Accid. Anal. Prev. 2022, 175, 106778. [Google Scholar] [CrossRef] [PubMed]
Oberfeld-Twistel, D.; Wessels, M.; Kröling, S. Risiko hohe Beschleunigung: Straßenquerungsverhalten von Fußgänger:innen in Interaktion mit E-Fahrzeugen (mit und ohne AVAS) im Vergleich zu Verbrennern; 1. Auflage; Gesamtverband der Deutschen Versicherungswirtschaft: Berlin, Germany, 2021; ISBN 9783948917074. [Google Scholar]
Wessels, M.; Kröling, S.; Oberfeld, D. Audiovisual time-to-collision estimation for accelerating vehicles: The acoustic signature of electric vehicles impairs pedestrians’ judgments. Transp. Res. Part F Traffic Psychol. Behav. 2022, 91, 191–212. [Google Scholar] [CrossRef]
Wessels, M.; Zähme, C.; Oberfeld, D. Auditory Information Improves Time-to-collision Estimation for Accelerating Vehicles. Curr. Psychol. 2023, 42, 23195–23205. [Google Scholar] [CrossRef]
Grimm, G.; Luberadzka, J.; Hohmann, V. A Toolbox for Rendering Virtual Acoustic Environments in the Context of Audiology. Acta Acust. United Acust. 2019, 105, 566–578. [Google Scholar] [CrossRef]
Rabbany, A.E. Introduction to GPS: The Global Positioning System; Artech House: London, UK, 2002. [Google Scholar]
Allen, B.J.; Berkley, A.D. Image Method for Efficiently Simulating Small-room Acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Schwark, F.; Schädler, M.R.; Grimm, G. (Eds.) Data-Driven Optimization of Parametric Filters for Simulating Head-Related Transfer Functions in Real-Time Rendering Systems. In Proceedings of the Euroregio BNAM 2022, Aalborg, Denmark, 9–14 May 2022. [Google Scholar]
ISO 532-1; Akustik -Verfahren zur Berechnung der Lautheit-Teil 1: Verfahren nach Zwicker (ISO 532-1:2017, Korrigierte Fassung 2017-11). International Organization of Standardization: Geneva, Switzerland, 2022.
Cohan, J. A Power primer. Psychol. Bull. 1992, 112, 155–159. [Google Scholar] [CrossRef]
Steinbach, L.; Beyer, F.; Altinsoy, M.E.; Oberfeld-Twistel, D.; Wessels, M. (Eds.) Safety Investigation on Traffic Scenarios Using Virtual Environments in a Wave Field Synthesis Laboratory; DAGA: Stuttgart, Germany, 2022. [Google Scholar]
Steinbach, L.; Altinsoy, M.E. Influence of an artificially produced stationary sound of electrically powered vehicles on the safety of visually impaired pedestrians. Appl. Acoust. 2020, 165, 107290. [Google Scholar] [CrossRef]
Altinsoy, E. (Ed.) The Detectability of Conventional, Hybrid and Electric Vehicle Sounds by Sighted, Visually Impaired and Blind Pedestrians; Internoise: Innsbruck, Austria, 2013. [Google Scholar]
Amtsblatt der Europäischen Union. Regelung Nr. 138 der Wirtschaftskommission für Europa der Vereinten Nationen (UNECE)—Einheitliche Bestimmungen für die Genehmigung Geräuscharmer Straßenfahrzeuge Hinsichtlich Ihrer Verringerten Hörbarkeit [2017/71]. 2017. Available online: http://data.europa.eu/eli/reg/2017/71(1)/oj (accessed on 23 June 2024).
Amtsblatt der Europäischen Union. Verordnung (EU) Nr. 540/2014 des Europäischen Parlaments und des Rates—Vom 16. April 2014—Über den Geräuschpegel von Kraftfahrzeugen und von Austauschschalldämpferanlagen Sowie zur Änderung der Richtlinie 2007/46/EG und zur Aufhebung der Richtlinie 70/157/EWG. 2014. Available online: https://eur-lex.europa.eu/legal-content/DE/LSU/?uri=planjo:20140414-003 (accessed on 11 July 2024).
Bazilinskyy, P.; Merino-Martínez, R.; Özcan, E.; Dodou, D.; de Winter, J. Exterior sounds for electric and automated vehicles: Loud is effective. Appl. Acoust. 2023, 214, 109673. [Google Scholar] [CrossRef]
Jenison, R.L. On Acoustic Information for Motion. Ecol. Psychol. 1997, 9, 131–151. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of the MATLAB (vR2022b) function for processing .tsc files into a playable file [28].

Figure 2. Level versus time (time weighting: fast) diagrams for the mean A-weighted level values of both dummy-head ear recordings for each vehicle scene.

Figure 3. GUI for the rating experiment.

Figure 4. General structure of the listening experiment and the simulation/playback methods.

Figure 5. Average rating of the perceptual attributes of the ICEV driving-by-scenes. The numbers from one to one hundred represent the rating results as a percentage. The first word of the net legend (e.g., “soft” from the attribute pair “soft–loud”) is related to zero percent The vehicle was traveling at a constant speed of (a) 10 km/h, (b) 20 km/h and (c) 50 km/h.

Figure 6. Average rating of the perceptual attributes of the EV driving-by-scenes. The numbers from one to one hundred represent the rating results as a percentage. The first word of the net legend (e.g., “dark” from the attribute pair “dark–bright”) is related to zero percent. The vehicle was traveling at a constant speed of (a) 20 km/h, (b) 30 km/h and (c) 40 km/h.

Table 1. For every playback environment, driving-by-scenes with the conditions as explained below were used.

Vehicle Type	Constant Velocity [km/h]
ICEV	10 20 50
EV	20 (AVAS) 30 (no AVAS) 40 (no AVAS)

Table 2. Maximum values of psychoacoustic and physical parameters for the recorded stimuli.

Scene	Max SPL [dB]	Max SPL(A) ¹ [dB]	Sharpness S [acum]	Loudness N [sone]
ICEV_10km/h_16chHOA	78.8	69	2.2	25.8
ICEV_10km/h_41chHOA	97.6	69	2.0	29.7
ICEV_10km/h_testtrack_recording	86.2	69	2.7	26.3
ICEV_10km/h_binaural_receiver	93.9	69	2.1	27.5
ICEV_10km/h_WFS	92.2	69	2.3	27.9
ICEV_20km/h_16chHOA	81.2	73	2.0	33.9
ICEV_20km/h_41chHOA	93.6	73	2.1	34.2
ICEV_20km/h_testtrack_recording	84.5	73	2.7	32.6
ICEV_20km/h_binaural_receiver	93.0	73	2.1	35.5
ICEV_20km/h_WFS	89.7	73	2.4	34.1
ICEV_50km/h_16chHOA	90.9	86	2.6	66.5
ICEV_50km/h_41chHOA	93.6	86	2.7	66.8
ICEV_50km/h_testtrack_recording	89.4	86	3.0	62.5
ICEV_50km/h_binaural_receiver	96.2	86	2.5	69.4
ICEV_50km/h_WFS	95.9	86	2.8	70
EV_20km/h_AVAS_16chHOA	76.8	72	1.6	27.5
EV_20km/h_AVAS_41chHOA	77.9	72	1.6	27.2
EV_20km/h_AVAS_testtrack_recording	75.7	72	2.2	29
EV_20km/h_AVAS_binaural_receiver	80.2	72	1.7	28.8
EV_20km/h_AVAS_WFS	78.8	72	1.7	26.3
EV_30km/h_16chHOA	81.0	73	1.8	32.1
EV_30km/h_41chHOA	80.9	73	1.7	29.1
EV_30km/h_testtrack_recording	76.4	73	2.3	29.2
EV_30km/h_binaural_receiver	81.6	73	1.7	30.9
EV_30km/h_WFS	81.5	73	1.9	31.3
EV_40km/h_16chHOA	85.5	78	2.0	44.7
EV_40km/h_41chHOA	85.1	78	2.0	42.9
EV_40km/h_testtrack_recording	80.2	78	2.5	39.4
EV_40km/h_binaural_receiver	89.3	78	2.1	45.5
EV_40km/h_WFS	87.3	78	2.3	43.2

¹ The playback systems were equalized according to their maximum A-weighted level.

Table 3. The perceptual attributes used and their corelating groups.

Dimension	Attributes
Timbral balance	Dark–bright
Naturalness	Realistic–unrealistic
Room-related	Limited–open
Source localization	Sounds from one point–sounds bigger
Loudness	Soft–loud
Velocity perception	Slow–fast

Table 4. The table contains the rmANOVA results, including the significance of the different factors for the ICEV driving-by-scenes. For every attribute, an rmANOVA was conducted.

Attribute	Factor	$F$	${d f}_{N u m e r a t o r}$	${d f}_{D e n o m i n a t o r}$	$p$	$Part . η 2$
Soft–loud	Scene	27.97	1.42	32.76	<0.001	0.549
	Playback	1.22	3.00	68.97	0.310	0.050
	Scene*Playback	2.15	8.00	184.00	0.033	0.085
Dark–bright	Scene	19.47	1.48	33.94	<0.001	0.458
	Playback	20.87	3.27	75.14	<0.001	0.476
	Scene*Playback	2.02	8.00	184.00	0.046	0.081
Slow–fast	Scene	123.04	1.69	38.80	<0.001	0.843
	Playback	5.70	3.11	71.58	0.001	0.198
	Scene*Playback	2.70	7.04	161.98	0.011	0.105
Realistic–unrealistic	Scene	3.97	2.00	46.00	0.026	0.147
	Playback	4.56	3.55	81.54	0.003	0.165
	Scene*Playback	3.09	8.00	184.00	0.003	0.119
Limited–open	Scene	0.10	2.00	46.00	0.908	0.004
	Playback	1.97	4.00	92.00	0.105	0.079
	Scene*Playback	1.29	8.00	184.00	0.250	0.053
Sounds bigger	Scene	1.15	2.00	46.00	0.326	0.048
	Playback	1.24	4.00	92.00	0.299	0.051
	Scene*Playback	0.42	6.79	156.09	0.883	0.018

Table 5. This table contains the rmANOVA results, including the significance of the different factors for the EV driving-by-scenes.

Attribute	Factor	$F$	${d f}_{N u m}$	${d f}_{D e n}$	$p$	$η 2$
Soft–loud	Scene	20.34	2.00	46.00	<0.001	0.469
	Playback	3.32	4.00	92.00	0.014	0.126
	Scene*Playback	3.57	8.00	184.00	<0.001	0.134
Dark–bright	Scene	23.24	1.71	39.44	<0.001	0.503
	Playback	17.85	3.04	69.94	<0.001	0.437
	Scene*Playback	5.25	8.00	184.00	<0.001	0.186
Slow–fast	Scene	16.81	1.25	28.77	<0.001	0.422
	Playback	8.28	4.00	92.00	<0.001	0.265
	Scene*Playback	1.15	8.00	184.00	0.334	0.048
Realistic–unrealistic	Scene	4.92	1.13	26.07	0.031	0.176
	Playback	3.44	3.09	71.04	0.020	0.130
	Scene*Playback	2.62	8.00	184.00	0.010	0.102
Limited–open	Scene	1.85	2.00	46.00	0.168	0.075
	Playback	2.44	3.40	78.29	0.063	0.096
	Scene*Playback	1.46	8.00	184.00	0.173	0.060
Sounds bigger	Scene	4.60	1.19	27.46	0.035	0.167
	Playback	2.58	3.46	79.52	0.052	0.101
	Scene*Playback	0.74	6.59	151.68	0.627	0.031

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krautwurm, J.; Oberfeld-Twistel, D.; Huisman, T.; Maravich, M.M.; Altinsoy, E. Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values. Acoustics 2025, 7, 7. https://doi.org/10.3390/acoustics7010007

AMA Style

Krautwurm J, Oberfeld-Twistel D, Huisman T, Maravich MM, Altinsoy E. Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values. Acoustics. 2025; 7(1):7. https://doi.org/10.3390/acoustics7010007

Chicago/Turabian Style

Krautwurm, Jonas, Daniel Oberfeld-Twistel, Thirsa Huisman, Maria Mareen Maravich, and Ercan Altinsoy. 2025. "Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values" Acoustics 7, no. 1: 7. https://doi.org/10.3390/acoustics7010007

APA Style

Krautwurm, J., Oberfeld-Twistel, D., Huisman, T., Maravich, M. M., & Altinsoy, E. (2025). Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values. Acoustics, 7(1), 7. https://doi.org/10.3390/acoustics7010007

Article Menu

Quality Comparison of Dynamic Auditory Virtual-Reality Simulation Approaches of Approaching Vehicles Regarding Perceptual Behavior and Psychoacoustic Values

Abstract

1. Introduction

1.1. Overview of Modern Audio Playback Systems

1.2. Related Work on Quality Perception in Selected Playback Environments

1.3. Aims of the Present Study

2. Materials and Methods

2.1. Laboratories

2.1.1. Wave Field Synthesis (WFS) at TUD Dresden University of Technology

2.1.2. Combination of Higher Order Ambisonics (HOA) and 3D VBAP at Johannes Gutenberg University Mainz

2.2. Scene Creation

2.2.1. Sub-Source Recordings in a Real Environment

2.2.2. Processing of the Scenes for the VR Approach

2.3. Recording Setup in Laboratories

2.4. Preparing Listening Test Scenes

2.5. Signal Analysis

2.6. Listening Test

2.6.1. Attributes

2.6.2. Playback System

2.6.3. Procedure

2.6.4. Participants

3. Results

3.1. Subjective Ratings of ICEV Driving-by Scenes

3.2. Subjective Ratings of EV Driving-by Scenes

3.3. Tables with the Results of the Statistic Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI