Ecological Validity of Immersive Virtual Reality (IVR) Techniques for the Perception of Urban Sound Environments

Immersive Virtual Reality (IVR) is a simulated technology used to deliver multisensory information to people under different environmental conditions. When IVR is generally applied in urban planning and soundscape research, it reveals attractive possibilities for the assessment of urban sound environments with higher immersion for human participation. In virtual sound environments, various topics and measures are designed to collect subjective responses from participants under simulated laboratory conditions. Soundscape or noise assessment studies during virtual experiences adopt an evaluation approach similar to in situ methods. This paper aims to review the approaches that are utilized to assess the ecological validity of IVR for the perception of urban sound environments and the necessary technologies during audio–visual reproduction to establish a dynamic IVR experience that ensures ecological validity. The review shows that, through the use of laboratory tests including subjective response surveys, cognitive performance tests and physiological responses, the ecological validity of IVR can be assessed for the perception of urban sound environments. The reproduction system with head-tracking functions synchronizing spatial audio and visual stimuli (e.g., head-mounted displays (HMDs) with first-order Ambisonics (FOA)tracked binaural playback) represents the prevailing trend to achieve high ecological validity. These studies potentially contribute to the outcomes of a normalized evaluation framework for subjective soundscape and noise assessments in virtual environments.


Introduction
Ecological validity was introduced in the 1980s to evaluate the outcomes of a laboratory experiment focused on visual perception [1]. Ecological validity describes the degree to which results obtained in a controlled laboratory experiment are related to those obtained in the real world [2]. The discussion of the ecological approach regarding its internal validity and experimental control began in the 1980s with cognitive and behavioral psychology research [1,3], and these two factors are still significant factors in the design and undertaking of an ecological approach study. Under laboratory conditions, researchers should give participants corresponding environmental cues and instructions to enable the reactivation of the cognitive processes of participants that were determined in actual situations [4]. For high ecological validity, the findings in the laboratory can be generalized into real-life settings [2]. As a simulated technology, Immersive Virtual Reality (IVR) places the user inside an experience, which allows the impact on participants of a new environment with complex social interactions and contexts to be assessed [5][6][7]. In 2001, Bishop et al. [8] reported their non-IVR assessments of path choices on a country walk, and they agreed that faster computers and better display systems make the virtual environment experience more credible. Thus, low ecological validity resulting from non-sufficient immersiveness could be a limiting factor for the generalizability of data collected from laboratory experiments.
The need for more research that explores applications of perceptual simulations in general and related questions of validity and reliability has been stressed ever since the emergence of environmental simulation as a research paradigm.
Ecological validity has been conceptualized into two approaches: verisimilitude and veridicality. Verisimilitude refers to the extent of similarity of a virtual experience to relevant environmental behaviors [9]; it reflects the similarity of the task demands between the test in the laboratory and the real world [10]. This approach attempts to create new evaluation assessments with ecological goals [11]. Veridicality refers to the degree of accuracy in predicting some environmental behaviors [12,13]; the establishment of veridicality is required to assess the results from the laboratory test and the measures in the real-world. There are some limitations for both approaches. One limitation of the veridicality approach is that, for those conditions which are not likely to be reproduced in the real world or that have a high cost, the outcomes from real-world measures cannot correlate with experiment results. When using the verisimilitude approach alone, no empirical data are needed to claim that the evaluation is similar to real life settings [11].
Virtual reality has revealed a functional rapprochement that fuses the boundary between the laboratory and real life [5]. Through multisensory stimuli with experimental control, participants tend to respond realistically to virtual situations as if they were in a real environment [14][15][16][17]. The responses to a virtual environment are generated when place illusion (PI) and plausibility illusion (Psi) occur at the same time [14][15][16]. The ecological approach studies based on virtual reality provide controlled dynamic presentations of background narratives to enhance the affective experience and social interactions [3,18]. From a methodological viewpoint, environmental conditions and test results can be ecologically validated through virtual reality technologies according to a subjective evaluation framework. Numerous researchers have examined ecological validity in different topics and fields with the comparison of a virtual environment and real life [19][20][21][22][23].
Spatial audio is a technique of creating sound in a 3D space; then, a listener can hear the sound from any direction in a sphere [24]. Because of this feature, it is often combined with virtual reality to render auditory stimuli. For the auralization of spatial audio, head-tracking requires the reproduction of a dynamic sound field based on the real-time position of the head within Euclidean space. Binaural recordings only reproduce the sound field of both ears at the time of recording, which shows the incompatibility between binaural recordings and head-tracking during auralization. Ambisonics is a sound reproduction technique used for recording and playing-back spatial audio, and it is based on the spherical harmonic decomposition of the sound field [25]. Ambisonics enables a listener to experience a spatially-accurate perception of the sound field [26], and this reproduction technique was originally introduced by Gerzon [27]. In the case of first-order Ambisonics (FOA)-currently the most widespread Ambisonics recording techniquethe signals are recorded as four audio channels, most usually in the so-called A-format. The audio files needed to reproduce such recordings are known as B-format audio files, which are converted from the A-format. The B-format can be decoded into any speaker array matching the needs of dynamic auralization under Immersive Virtual Reality (IVR), including higher-order Ambisonics (HOA). HOA has a higher spatial resolution based on higher-order spherical harmonics [25]. Head-related transfer function (HRTF) is considered as a frequency response describing the sound pressure transformation from a free field point source to the eardrum [28]. When this filtering is not applied with the listener's own HRTF (acknowledging individual head size, auricle size and shape, etc.) [29][30][31], front-back and elevation confusions in localization typically occur [32].
For many urban sound assessment studies, in situ surveys have been widely applied as a conventional method to evaluate certain sound environments [5,33,34]. In soundscape or noise assessment studies, researchers expect the presentation of controlled experimental conditions to participants; e.g., recorded audio and reconstructed visual stimuli in a listening room. Therefore, researchers introduced laboratory tests to validate their research questions with human participation. All simulations under laboratory conditions attempt to represent some aspects of the environment as accurately as possible to assess human responses. In urban noise prediction and soundscape assessment research, an audio-visual system is a conventional and valid approach to render essential information or cues during human participation. The audio-visual interaction influences the perception of the soundscape and global environment, as shown in previous studies [35][36][37][38][39][40][41]. For interior spaces with VR techniques, several studies have assessed the evaluation of indoor noise protection with head-mounted displays (HMDs) [42], the main uses of auralization [43], the influence of visual distance [44], the use of water features [45] and the spatial representations of visually impaired participants [46,47]. The urban sound environment in this review refers primarily to sound sources originating outdoors or in urban public spaces, and it reflects, to some extent, the mobility of people and the multifunctionality of urban spaces.
The evaluated multisensory method shows enormous significance in helping participants to perceive environments holistically. The reproduction system of listening tests needed to be adapted to the purpose of the study to allow the subjects to treat the test samples as potentially familiar experiences through cognitive processes elaborated in actual situations. With the aid of immersive virtual reality, the installation of laboratory conditions was performed with the aim of reproducing urban sound environments and presenting a multisensory experience to participants. A subjective test of immersive virtual reality reproduction in urban sound environment assessments would show high veridicality if it correlated well with measures of perceptual responses in the real world.
The concept of ecological validity has been extended from psychological experiments to the domain of complex sound environment perception. It is not only related to the evaluation methods during laboratory tests, but also closely associated with the developing IVR technologies. Attempting to establish a standardized soundscape evaluation protocol with high veridicality under an immersive virtual environment has a broader impact on the practice of soundscape planning and design. The research on soundscape standardization has discussed the definitions, variety of contexts, evaluation methods and reporting requirements [48,49].
The International Organization for Standardization Technical Standard (ISO TS) 12913-2 [34] introduced two common recording techniques in soundscape research: binaural and Ambisonics. The standard states that if some environmental factors are not present or differ during playback, the outcomes could possibly result in different impressions to those received in the original context. In terms of the statement of ISO TS 12913-2 [34], the validity of these auralization techniques combined with other environmental factors still presents some uncertainty. The ISO TS 12913-3 [50] stated that the key factors to consider when conducting ecologically valid laboratory studies are the effect of memory, the duration of exposure to each of the stimuli and the auditory immersiveness. As a multisensory tool, IVR could deliver more environmental stimuli than conventional 2D rendering methods. A comparison of the ecological validity using IVR for urban sound environments with different reproduction techniques and research topics was made. In this review, we aim to investigate (1) which kinds of approaches can be utilized and integrated to assess the ecological validity of IVR when humans perceive urban sound environments and (2) which technologies are necessary during audio-visual reproduction to establish a dynamic IVR system to assess the perception of urban sound environments.

Search Strategy and Eligibility Criteria
There were no pre-defined protocol registrations for this review. The basic process and data extraction forms were agreed upon at the beginning of the review work. The study was performed under the guidance of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [51].
Given the exploratory nature of this study, as many studies do not directly mention "ecological validity", and they may not include the terms "ecologically valid", "ecologically validate" and similar expressions, the studies were selected manually according to the following inclusion criteria: (1) original participatory studies using virtual reality techniques conducted in a laboratory, and (2) studies collecting subjective responses under virtual environments. The subject areas included "ecological validity", "acoustics" and "virtual reality". Some studies did not directly mention "ecological validity", but the workflow was under the framework of the virtual sound environment evaluation described above. These studies were selected in the review with full-text scanning.
Studies were identified by searching the electronic database, scanning reference lists of articles and in consultation with experts in the field. At least one search database should be used in systematic reviews [51], and a literature search was conducted using the Web of Science. Only peer-reviewed journal articles published in English were considered. This search was applied to the Web of Science (1980-2020). The last search was run on 01 July 2020. We used the following search terms to search the databases: "sound", "perception", "participant" and "virtual reality". An eligibility assessment of the review was performed independently in an un-blinded standardized manner by one researcher.

Data Extraction
Information was extracted from each included document regarding (1) the research focus of the studies, (2) participant numbers, (3) in situ responses vs. laboratory experiment data and (4) the main parameters selected in the studies. Considering ecological validity across the selected studies with various topics and different outcomes, a qualitative approach was adopted to answer the review questions.

Results
The initial results showed 65 documents. Fifty-three items were excluded because the topic of the papers failed to meet the eligibility criteria, which included (1) the studies not using a virtual reality head-mounted device, and (2) the studies not involving soundrelated perception. The full texts of the remaining 12 papers were accessed, and these 12 papers were included in the review. Table 1 shows the research focus, the participant numbers, in situ responses vs. experiment data, the main parameters and variables in these studies. These studies with IVR had different emphases on their subjective evaluation and research focuses, and they assessed ecological validity with subjective responses varying from environmental preferences/quality, audio/visual indicators, coupled interactions and reproduction quality.

Ecological Validity with Subjective Responses
Generally, these studies were not only limited to one topic, and several topics were often integrated together. The audio-visual interaction was also one of the sub-topics of these studies. Most of these works addressed the importance of audio-visual interaction in IVR-based soundscape or noise assessments. The audio-visual interaction in these studies was discussed in an attempt to interpret how participants perceived the virtual environments, and the ecological validity was also tested with their research questions. Global environmental evaluation, visual and acoustic coherence and familiarity and visual and acoustic congruence were compared, respectively, through the field survey and the laboratory experiment to jointly validate the acoustic and visual congruence between the simulated and real world [5]. Both groups in the in situ session and the laboratory session with 16 participants, respectively, were recruited for the in situ and laboratory sessions. Both comparison groups showed robust similarity in visuo-acoustic coherence and familiarity and the visuo-acoustic salience of urban, human and natural activities.
Related to audio-visual interactions, in 2020, Jeon and Jo [52] examined the contribution of audio and visual stimuli in the evaluation of urban environment satisfaction under an immersive virtual environment. Three conditions were considered: (1) audio only, (2) vision only and (3) audio-visual interaction. The contributions of audio and visual information on overall satisfaction were 24% and 76%, respectively. The study by Ruotolo et al. in 2013 [15] asked the participants to answer questions about auditory and visual annoyance, respectively. The results presented in their study showed both auditory information and visual information in a close interaction, supporting participants perceiving the virtual environment holistically. Aletta et al. in 2016 [53] carried out a study to investigate the chiller noise involved with the distance to a source and the visibility of a source, as well as introduction-or not-of a visual reference context and performance-or not-of a cognitive task. They found that the visibility of a source is not a significant influencing factor for noise perception for the kinds of chillers examined in the study. In 2019, Jeon and Jo [60] carried out a study to assess the noise in urban high-rise residential buildings. They reported that the directional and visual information generated by HRTFs and HMDs could affect sound perception and virtual environmental immersion. Two parameters, HRTFs and HMDs, were coupled into four cases: no HRTF-no HMD, no HRTF-HMD, HRTF-no HMD and HRTF-HMD. The results showed that the contribution of the HRTF to subjective responses was 77% higher than the contribution of HMD, at 23%. This study showed the applicability and necessity of the HRTF and HMD to assess noise in terms of the audio-visual interactions under immersive virtual environments. In 2012, Iachini et al. [54] assessed acoustic comfort aboard metros through subjective annoyance and cognitive performance measures. In their findings, visual contexts could be considered as a modulating method affecting noise annoyance for people. Noise barrier designs are generally associated with noise assessment, and different noise barrier designs were assessed under an immersive virtual environment [56,57]; different project solutions concerning noise mitigation in order to obtain more reliable results on local residents were also examined.
The potential environmental risks and negative effects of wind parks, as emerging landscape projects, were also evaluated under virtual environments. Maffei, Iachini, et al. in 2013 [55] stated that the noise perception of wind turbines under immersive virtual reality requires extended experiments to ensure its ecological validity, especially the results from in situ sessions. In 2017, Yu et al. [58] conducted a subjective test revealing that wind parks can increase both the aural and visual annoyance associated with personal attitudes toward wind parks. The research of virtual reality technologies in the sound environments of wind parks ecologically validated these potential negative influences.
Soundscape evaluations show a trend of using multi-dimensional attributes to test participants' perception in a virtual environment. In 2019, Hong et al. [59] carried out a study exploring the ecological validity of reproduced acoustic environments based on three spatial audio reproduction methods. The main indicators in their study included sound preferences, visual preferences, soundscape attributes, visual attributes and environmental satisfaction, as shown in Table 1. In 2019, Sun et al. [61] proposed a hierarchical soundscape classification method using virtual reality playback with a participatory experiment inside a soundproof booth. The method, based on different classification components, could be potentially validated by verification on an independent dataset.
In immersive virtual reality laboratory experiments, the numbers of participants differ in different subjective studies. The minimum number of participants in the subjective test was 16, in the work by Maffei et al. in 2016 [5], and the maximum number reached 71, in the work by Sanchez et al. in 2017 [57]. The number of participants for most subjective tests in the laboratory ranges from 20 to 60.

Reproduction Systems
The reproduction systems in these studies mainly include two aspects of auralization and visualization, as shown in Table 2. To simulate an immersive auditory environment, Ambisonics is a prevailing method to record and auralize sounds, which allows various decoding patterns with the flexibility to lay out loudspeaker positions or headphones. In the headphone-based reproduction method, the recorded stimuli captured in Ambisonics formats are most usually presented as either head-tracked or static binaural renderings [52,59]. In the loudspeaker array-based reproduction method, there is no need for software to compensate for head movement in real time [59]. The spatial recognition is influenced by the use of the HMD, and the impact of perceived realism will significantly increase compared with the condition without HMD [59]. Simulated visual environments can also be built using software including-but not limited to-3ds Max [56,57], Google SketchUp [5,15,54,60], Unity [57,58], Kubity [60] and WorldViz [53][54][55]. In 2019, Hong et al. [59] reported no significant differences in perceived dominant sound sources and affective soundscape quality between reproduction and in situ results. These findings are in agreement with previous studies showing that IVR HMDs with Ambisonics could be a reliable tool for soundscape assessment as an alternative to in situ surveys.
Some devices have been introduced to record information and render stimuli. A panoramic camera is usually used to record omnidirectional videos as visual stimuli in the laboratory test [52,59]. A hybrid and simultaneous audio and video recording setup was used in the study by Sun et al. in 2019 [61]. This setup consists of binaural audio (an artificial head with windshield and binaural recording device), a FOA microphone and a 360 • video camera. A mobile device (a Google Cardboard headset) was also used in the evaluation of the audio-visual perception of wind parks. This portable HMD also showed the potential to provide an immersive experience in response to participants' head movements.

Visual construction methods
Visual rendering 3ds Max [56,57] HMD [5,15,42,[52][53][54][55][56][57][58][59][60][61] Google SketchUp [5,15,54,60] WorldViz [53][54][55] Unity [57,58] Kubity [60] Panoramic views [52,58,60,61] Notably, owing to the fact that the entire IVR industry is driven by both hardware and software upgrades, older ecological validity studies on virtual environments face limitations in terms of their utility or efficacy. It would be expected that the advancement in the computation of IVR simulations would ultimately increase the ecological validity of participatory studies conducted in laboratories. A comparison of the technical parameters of all IVR systems in these studies shows the limitations of initial research and how these limitations are gradually improved by subsequent studies. However, due to the lack of control measures across the analyzed studies, it was not possible to conduct such a comparison. We cannot systematically assess the differences between the studies.

Subjective Response, Cognitive Performance and Physiological Response
Many studies have suggested that urban noise can negatively affect people's cognitive functions and influence their daily life [62][63][64]. Subjective responses may not show annoyance regarding urban noise, but the cognitive performance may be affected. Thus, during the laboratory test, some studies also used cognitive tasks to evaluate the cognitive performance caused by the virtual environment [15,53,54,58]. Related to stress recovery, researchers have used measures based on the physiological responses of participants. Annerstedt et al. in 2013 [65] conducted a study to investigate the sounds of nature inducing physiological stress recovery, and the Trier Social Stress Test (TSST), as a highly standardized protocol for inducing stress, was applied in their study. Cortisol, heart rate, T-wave amplitude (TWA), and heart rate variability (HRV) were tested to analyze the physiological stress recovery induced by the sounds of nature. Hedblom et al. in 2019 [66] adopted mild electrical shocks and skin conductance measurements to evaluate the stress recovery under virtual environments with a birdsong-traffic noise interaction. Compared with subjective responses, physiological responses do not directly reflect the relationship between subjective sound preferences and characteristics of acoustic environments. Thus, these three methods can jointly assess the ecological validity of complex sound environment perception.

Other Visual Rendering Methods
For visual rendering, many studies used non-HMD options. Some of them adopted non-immersive methods, such as a monitor screen [8,19,45,[67][68][69], visual screen [70] and 2D projection [71,72]. Some of the studies utilized the immersive Cave Automatic Virtual Environment (CAVE) system [65,73]. The CAVE system was first introduced in 1992 [74], and the aim of its invention was to provide a one-to-many visualization experience that utilizes large projection screens [75]. Compared with a CAVE system, HMD has some problems, especially when one user is trying to interact with other users, and it does not offer interaction with real objects aside from VR control devices [76]. The large footprint, the cost of high-resolution projectors and the human-computer interaction are also reported to be limitations for a CAVE system [76].
Studies without visual stimuli were also conducted [4,77,78]. A visual component presents rationality when examining the ecological validity of auditory perception. The coupled audio-visual interaction is associated with the spatial attributes of sound perceptione.g., distance, width and directionality [60]-and it also provides an animated visual anchor, improving the sense of presence and immersiveness during the subjective evaluation [77,79].

Verisimilitude and Veridicality
Verisimilitude and veridicality in IVR-based sound environment research have different emphases according to their definitions. Establishing verisimilitude and veridicality in a subjective evaluation experiment allows a virtual sound environment to be perceived with reliable ecological validity. The IVR research involved with verisimilitude in soundscape or noise assessments assumes that the stimuli in the test and the cognitive processing are sufficiently similar to the psychological construct of corresponding scenarios in the real world. The verisimilitude approach is likely to focus on specific tasks in the laboratory test similar to the task demands in the real world. The evaluation indicators and questionnaire design can be formatted in a quite similar way to a participatory experiment. Sanchez et al. in 2017 [57] pointed out that their study did not strictly prove that audio-visual designs in a virtual environment would lead to the predicted pleasantness of real environments. Establishing verisimilitude in soundscape evaluation is more intuitive compared with establishing a new cognitive task or a clinic neuropsychological assessment. However, when researchers discuss the relationship between subjective responses, cognitive performance and physiological responses, they need to carefully examine the verisimilitude approach with which some aspects of testing conditions limit the applicability of a method without empirical data to the real world.
A few studies validated veridicality in IVR-based soundscape or noise assessments. The pioneering studies examined several fundamental playback systems. In 2005, Guastavino et al. [4] explored the linguistic analysis of verbal data in soundscape reproduction through a field survey and two listening tests. Both listening tests compared exposure to the stimuli reproduced via stereophonic and Ambisonics approaches. They pointed out that both neutral visual elements and a good sense of spatial immersion should be provided to ensure ecological validity when testing the effects of urban background noise. Both reproduction methods have been demonstrated to be ecologically valid tools in terms of source identification. However, IVR was not applied in their study. Many perceptual attributes and indicators have been selected to describe the similarity between the real world and the laboratory conditions. In 2016, Maffei et al. [5] compared the congruence between audio and visual elements, and there was no significant difference in the perceived global quality of the environments in both the simulated and real world in their results. The global quality of the environments was shown to have high veridicality under the framework of subjective evaluation. The findings are consistent with the results of audio-visual interaction evaluation studies conducted in urban sound environments. In 2019, Hong et al. [59] validated three Ambisonics reproduction methods and tested their veridicality under a virtual sound environment related to the performance of reproduction methods. Immersive virtual reality has been shown to be a valid tool to simulate multisensory environments not only by acousticians but also in clinical neuroscience, cognitive psychology and other research fields. When researchers adopt the verisimilitude approach, they believe that the reproduction system and the subjective test have veridicality. In addition, there are also some difficulties to validate veridicality resulting from the complex contexts and unpredictability of outdoor sound environments. For outdoor sound environments, it is sometimes impossible to measure the real-world; e.g., a projected area without construction. Some contextual conditions cannot be changed independently in the real world as well.
It is notable that two studies addressed realism in their subjective experiments. The study by Jeon and Jo in 2019 [60] validated that the usage of HMD significantly increased the impact on the recognition of realism. In 2019, Hong et al. [59] conducted both in situ and laboratory experiments to assess the performance of different Ambisonics reproduction systems in perception. They both successfully assessed realism in their studies. The former de-emphasized the verisimilitude to the real world, and they underlined the realism difference brought by HMD compared with the non-HMD condition. The latter conducted a veridicality study with in situ responses, and they described the degree to which different reproduction approaches were similar to reality. When both verisimilitude and veridicality are examined, the most ecologically valid studies [5,59] revealed the congruence between immersive virtual experience and real experience along with multisensory stimuli.

Limitations
An IVR system in soundscape or noise assessment should be adapted to the relationship between human cognition and subjective perception during the laboratory experiment. The diversity of IVR rendering techniques also brings an unnormalized experience to participants. An online survey has been introduced as a non-IVR tool to evaluate soundscape and noise perception [68]. Web-based virtual reality was constructed in computationally cheap ways, and it could be improved with higher auralization and visualization quality. The one-to-one nature of tests also showed that the laboratory test cannot reach the sample size of traditional surveys. More economical and vivid reproducing systems following the development in hardware and software show higher veridicality.
HRTFs significantly contribute to localization performance [80,81]; e.g., sound recognition of the direction and width of the source [60]. Compared with the non-HRTF environment, the results of the subjective responses of immersion, realism and externalization are higher in the HRTF case [60]. Individualized and non-individualized HRTFs were utilized to assess various perceptual attributes by Simon et al. in 2016 [31]. It is necessary to select a suitable HRTF that is well matched to the listener's own HRTF [31] to ensure ecological validity in terms of sound source localization, and it can be an individualized HRTF or from a HRTF database. For different sound environments, such as a lively urban square with multiple water features, a quiet park or a park adjacent to a motorway, whether sound source localization is considered a key feature or not [82,83], the choice of an HRTF could differ in terms of ecological validity, and further studies are still needed.
At the moment, a head-tracking display system synchronizing FOA-tracked binaural playback shows reliable validity under immersive virtual experiences for complex sound environment perception. Compared with FOA, HOA significantly improves the quality of this experience [84]. Different systems of HOA have already been implemented as hearing aids research for subjects with hearing loss [85,86]. HOA is becoming popular in industrial applications such as Youtube360 and Facebook360 [87], and it shows great potential for the ecological validity of IVR in further urban sound environment studies.

Conclusions
This paper aims to review the approaches to assess the ecological validity of IVR for the perception of urban sound environments and the necessary technologies during audio-visual reproduction ensuring ecological validity. The review qualitatively shows that immersive virtual reality techniques have the potential to contribute greatly as an ecologically valid tool in soundscape or noise assessments. The ecological validity of virtual reality to assess urban sound environments is multimodal, dynamic and contextual. The main conclusions of this work are as follows.

1.
Through the approaches of laboratory tests including subjective response surveys, cognitive performance tests and physiological responses, the ecological validity of complex sound environment perception can be assessed for IVR. With participatory experiments in situ and in a laboratory, the veridicality of IVR can be verified through subjective responses including environmental preferences/quality, audio-visual indicators (e.g., pleasantness and annoyance), coupled interactions and reproduction quality (e.g., realism and immersiveness).

2.
A head-tracking unit with a display and synchronized spatial audio (e.g., HMD with FOA-tracked binaural playback) is advantageous to assess ecological validity in immersive virtual environments. When the urban sound environment research involves interaction among multiple users, a CAVE system should be considered. With higher spatial resolutions, HOA also shows increasing potential for the ecological validity of IVR in urban sound environment research.
These studies on ecological validity with the utilized evaluation methods also go beyond the outcomes gained towards a normalized framework in soundscape and noise assessment protocols. For standardized soundscape evaluation, the ISO TS 12913 series should give more detailed guidelines and specifications on the establishment of an IVR system. In particular, to deliver a dynamic virtual experience, more research is needed on the influence of the Ambisonics orders of complexity at the recording and reproduction stages, and issues such as encoding and decoding Ambisonics formats, on soundscape perception. The pursuit of a standardized soundscape evaluation protocol and IVR-based soundscape research can serve to enhance the field as a whole.
Author Contributions: All authors of this research contributed to its conceptualization and writingreview and editing. Methodology, C.X., H.T.; formal analysis, C.X., T.O., F.A.; writing-original draft preparation, C.X. All authors have read and agreed to the published version of the manuscript.