Measuring the Behavioral Response to Spatial Audio within a Multi-Modal Virtual Reality Environment in Children with Autism Spectrum Disorder

: Virtual Reality (VR) has been an active area of research in the development of interactive interventions for individuals with autism spectrum disorder (ASD) for over two decades. These immersive environments create a safe platform in which therapy can address the core symptoms associated with this condition. Recent advancements in spatial audio rendering techniques for VR now allow for the creation of realistic audio environments that accurately match their visual counterparts. However, reported auditory processing impairments associated with autism may affect how an individual interacts with their virtual therapy application. This study aims to investigate if these difﬁculties in processing audio information would directly impact how individuals with autism interact with a presented virtual spatial audio environment. Two experiments were conducted with participants diagnosed with ASD ( n = 29) that compared: (1) behavioral reaction between spatialized and non-spatialized audio; and (2) the effect of background noise on participant interaction. Participants listening to binaural-based spatial audio showed higher spatial attention towards target auditory events. In addition, the amount of competing background audio was reported to inﬂuence spatial attention and interaction. These ﬁndings suggest that despite associated sensory processing difﬁculties, those with ASD can correctly decode the auditory cues simulated in current spatial audio rendering techniques.


Introduction
Autism spectrum disorder (ASD) is a lifelong neurodevelopmental disorder that affects approximately 1% of the worldwide population [1]. It is characterized through core symptoms that include impaired development in social interaction, communication, repetitive behaviors, restrictive interests, and sensory processing difficulties [2]. Autism is heterogeneous, and despite these symptoms being commonly identified with the condition, the scope of which these difficulties affect everyday life are often unique to the individual [3]. With no cure for ASD, there is an increasing need for intervention programs that target appropriate behavioral and educational complications with the aim to improve the quality of life for the individual and their family [4].

The Importance of Spatial Audio
Larson et al. [18] suggest that the spatial properties of virtual auditory environments have been of significant importance since the introduction of stereo sound in the 1930s. Stating that spatial sound is used to simulate an auditory reality that gives the user the impression they are surrounded by a 3D virtual environment. In terms of precision, spatial acuity is somewhat inferior to vision and somatic sensory system when perceiving the environment [19]. However, it is not unimportant. Spatial hearing is a critical way in which the user perceives space, providing information about events and objects that are far beyond the field of view [20].
Two studies conducted by Hendrix and Barfield [21] investigated how three-dimensional sound influences the user experience within a stereoscopic virtual environment (VE). The first compared a silent VE to environments with spatial sound. The second was a comparison between two environments with spatialized and non-spatialized sound sources. The results indicated that spatialized audio significantly increased the feeling of presence, with interactive sounds to be originating from their visual cues perceived as more realistic.
To perceive the real world, the human brain must decode a constant stream of multi-modal information from the various sensory channels [22]. An example of which is most cases of face to face communication. The brain must interpret visual information such as moving lips and facial expressions, while also listening to localized speech sound from the mouth [23]. This cross modality synchronization can also be echoed in VR by using 3D audio, by simulating the shared spatial qualities of both audio and visual stimuli that occupy the same time and space.

Spatial Audio Rendering
For healthy listeners, the human auditory system is adept at decoding how a sound wave diffracts and interacts with the head, torso, and pinnae, resulting in both temporal and amplitude differences changes as well as spectral cues [24].
In terms of localization interaural time and level differences (ITD and ILD) are the two major cues attributed to sound localization along the horizontal plane (see Figure 1). For low frequency sound sources (<1.5 kHz), the detection of ITD is the primary mechanism [25]. This frequency threshold is a result of the distance a sound wave must travel to reach the furthest ear, causing the phase angle to differ upon arrival. The corresponding angle being both the function of the wavelength and the source positioning. With frequencies below 1.5 kHz the wavelength is greater than then maximum difference in time, this phase difference will therefore provide accurate location cues [26]. For higher frequency targets, ILD detection takes over from temporal differences as the primary horizontal localization mechanism [25]. Shorter high frequency sound waves are more susceptible to the acoustic shadowing caused by the human head. It is this increased shadowing effect that produces a greater difference in sound levels between the two ears [27]. Localization within the vertical plane is calculated from the spectral shaping of acoustic energy by torso and the unique and complex geometry of the pinna. Frequencies are distributed and consequently reflected causing minute but significant delays. It is these delays that will act as a comb filter effect on sound entering the inner part of the ear, representing a three-dimensional function of the elevation of the sound source [29,30].
The realistic reproduction of three-dimensional auditory environments through headphones, known as binaural-based spatial audio, is dependent upon the correct simulation of the interaural differences and spectral shaping cues caused by the torso, head and pinnae called Head Related Transfer Functions (HRTF) [31]. HRTF's are the functions of both frequency and position for both azimuth and elevation, containing direction-dependent information on how a free field sound source reaches the ear [32,33]. Binaural rendering can then be achieved through the convolution of a monoaural anechoic sound source and the HRTF of both the desired direction and the corresponding ear [32,34].
Today most VR systems rely on head-tracked headphone-based systems (see Figure 2). Head tracking is definitive in creating a plausible and realistic virtual audio scene. Current virtual reality head devices can provide stable multi-modal interaction through accelerometers which can detect force, 360 orientation and measure the device's position within a 3-D space . Once the rendering system has received the X, Y, and Z directional information, it performs interpolation between the closest HTRF's resulting perceived localization of a dynamic virtual sound source [35,36]. Furthermore, by reproducing the audio over headphones it is possible to isolate the left and right audio channels and successfully reproduce the binaural localization cues. However, there are several auditory processing difficulties associated with autism spectrum disorders that may affect how a user may respond to auditory cues within a VR intervention environment.

Atypical Auditory Processing in Autism
Commonly, children with autism experience complications in processing sensory information with up to 65% of individuals with ASD displaying complications in auditory domain [37][38][39]. There has been extensive research pertaining to auditory processing in ASD. Observations are diverse, ranging from enhanced abilities in musical pitch discrimination and categorization [40], to hypersensitivity of particular sound stimuli which evoke extreme negative behavioral reactions [41]. Of particular note, there have been several psychoacoustic studies which have investigated auditory scene analysis.
It has been reported that in the presence of competing background noise, those with autism often have difficulties in separating relevant auditory information (i.e., speech and target noises) into discrete auditory objects connected to the different sound sources within the environment [42]. These complications in processing sound information in noise have links to a deficit in auditory spatial attention, which is often exhibited as a failure to orient to speech and other social stimuli [43].
Soskey et al. [44] found that compared to typically developed (TD) controls, children with ASD showed a significant impairment in attending to sounds, in particular speech sounds.
During typical development, an infant of 6-weeks can display a significant sensitivity to social stimuli, with particular focus to the features of the human face and speech [45,46]. Autistic children on the other hand have been reported to display atypical cortical responses to synthetic speech-like stimuli, culminating in reduced involuntary attention to social auditory events [47]. These early limited reactions to social stimuli represent the core social impairments in the earliest and most basic form and have negative impacts on future social and communication development [48].
In addition, Lin et al. [42] observed a reduced response to speech in noise in individuals with autism, alongside reduced discrimination for ITD and ILD. These binaural cues are processed by an area of the auditory pathway known as the medial superior olive [49]. Interestingly, postmortem examinations of the brainstem olivary complex in those with ASDs observed neurological malformations resulting in a reduced size of the medial superior olive [50,51]. Despite this, people with autism spectrum disorders have been shown to perform equally well as healthy controls in noise localization experiments across the horizontal plane [52]. Therefore, an impaired sensitivity to ITD and ILD cues in autism have been linked to deficits in selective hearing, which can affect orientation to target sounds within noise [42].

Design and Hypothesis
This study aims to examine auditory spatial attention and sound localization ability of children and adolescents with ASD within a multi-modal virtual reality game environment. The therapeutic outcomes of VR interventions often rely on the immersion of the participant through place illusion [53]. This is particularly important when designing virtual interventions for individuals with autism [11], a population who have difficulties in generalizing newly skills to real-world applications. By conducting investigations within a VR environment similar to which participants may experience within an intervention setting, test parameters may be precisely controlled whilst the user interactions and behavioral models can be robustly recorded. Furthermore, the inclusion of dynamic visuals and game mechanics maintains participant immersion, therefore examining a pattern of behavior closer to that within a conventional virtual reality experience removed from experimental conditions [54]. Previous work by Wallace et al. [55] measured spatial presence within the 'blue room', an advanced collaborative virtual environment (CVE). The room makes use of visual images projected onto four screens located on three walls and the ceiling, with audio being delivered over a surround sound loudspeaker system. The paper reported that children with ASD showed equal amounts of attention to the virtual content than healthy controls and engaged with the virtual environment requirements. However, loudspeaker-based systems are often less effective at simulating the necessary ITD and ILD cues needed for three-dimensional spatial audio reproduction. Speaker crosstalk causes significant signal interference between each speaker and the opposite ear of the listener, resulting in impaired localization performance [56]. In contrast, head-track headphone-based systems are more successful at recreating dynamic soundfields which respond to the listeners position within the virtual world [36].
This investigation followed an in-between subject design. Participants were randomly allocated to either a 3D spatial audio, or control group. The 3D audio group would be exposed to spatial audio reproduced binaurally over headphones. The control group would listen to a monaural representation of the virtual auditory scene presented to both ears over headphones. Each participant group completed a two-phase experiment which first tested spatial audio attention, followed by a sound localization task. Both within a multi-modal virtual reality environment (see Figure 3).
It is hypothesized that subjects exposed to binaural-based spatial audio would show more spatial attention to auditory stimuli compared with the mono sound control group, indexed through more precise head orientation towards the presented audio targets. Furthermore, spatial audio stimuli would influence participant positional movement with the virtual environment. Additionally, based on previous studies of auditory scene analysis deficits in ASD, it is predicted that localization times will be higher for audio cues with the most amount of competing environmental background noise. Finally, by examining spatial attention towards complex speech and non-speech sounds, it is possible to investigate if differences in spatial auditory attention are the consequence of a reduced orientation towards social stimuli.

Participants
A study group consisted of 29 children and adolescents (27 male and two female, mean age = 14, SD = 2.42, range of 9-19 years). Participants were recruited from two special education schools and one charity. All participants had a formal diagnosis of autism spectrum disorder obtained from their local national health trust, displaying social, cognitive, and motor functioning associated with moderate to high functioning autism. Exclusion criteria were self-reported hearing problems; physical disabilities that would limit movement around experiment space; and an inability to finish the task. Two participants were excluded due to not being able to complete the full experiment. This study and methods were approved by the University of York board of ethical approval, and an information package was provided to the participants parent(s) or legal guardian(s). Participants were admitted into the study after informed consent and assent was obtained from their parent(s) or legal guardian(s).

Equipment
All audio and visual stimuli were rendered using the Oculus Rift CV1 head mounted display (HMD) with built-in on-ear headphones. Head tracking was also achieved using the HMD, with motion tracking of participant position calculated by the Oculus Rift sensors [57].

Stimuli
Visual Environment: The virtual environment used for this experiment represents an enchanted forest setting at night (see Figure 4). It has been noted that when designing visual aspects of virtual reality for ASD applications , the presence of visual clutter significantly affects user performance at completing tasks [58]. Therefore, the graphical environment was designed to be engaging without presenting visual elements that may have driven user attention away from the auditory stimuli. Furthermore, individuals with autism that also exhibit abnormalities in processing visual information can often attempt to avoid visual input such as bright lights [59]. The use of a darker environment reduces the possibility of anxiety related behaviors that are a result over sensory over-stimulation [60].  Tables 1 and 2. All stimuli were recorded or edited using Reaper digital audio workstation and WWise [61] game audio engine. Experimental groups were based upon audio rendering techniques. The 3D audio group would listen to binaural-based spatial audio rendered via the Google Resonance spatial audio SDK [62]. Audio was presented at a fixed playback level of 68 dbA. This was matched to a −20dBFS RMS value of a pink noise test signal playing at a distance of 1m from the listener.
Phase One: Eight spatial auditory events (see Table 1) were placed in the VE. All audio events were stationary with the exception of event 7 which moved around the entire environment over the course of 30 s. Previous research conducted by the authors has noted that target stimuli among competing audio could prompt negative effects on auditory spatial attention within VR environments for individuals with ASD [63]. This is further supported by research that reports within competing background noise, those with autism often display difficulties in separating relevant auditory information (i.e., speech and target noises) into discrete auditory objects connected to the different sound sources within the environment [42]. To minimize the effects of any complications in auditory scene analysis associated with ASD, environmental audio was reduced to −43 LUFS while each Phase One event was rendered.
Phase Two: Eight spatial audio events representing virtual characters were placed throughout the VE. In terms of type, audio stimuli were divided into simple and social types. Simple stimuli consisted of a bell sound effect, with the social stimuli representing a speech-like sound. All events were stationary except for movement towards the player once the character was found.
A spatialized composite background audio track was provided which was derived from several discrete auditory objects placed throughout the VE, matching the visual scene. This included sounds representing wind, water, frogs, birds, crickets and wind chimes.
The level of the background sounds was either ≥−43 LUFS or ≥−53 LUFS. These levels were taken from a past study that evaluated the effects of background noise on spatial attention in VE's with participants with ASD [63]. Using an audio object-based hierarchical model based on a simplified technique designed by Ward et al. [64] to increase the accessibility of complex audio content for those with hearing impairments. Background audio levels would be reduced by removing the bird, cricket, and wind chime ambient sound objects; the aim of this process is to maintain participant immersion.
To further investigate auditory detection in noise, target stimuli were played at differing signal-to-noise ratio (SNR). For auditory events with +10 SNR, the stimulus level would be 10 LUFS higher than the background noise level. For auditory events with 0 SNR, the stimulus would be set at an equal level to that of the background noise. Crow sound effects None 7 Explosion & Speech-like stimuli Present 8 Explosion & Speech-like stimuli None Table 2. Phase Two Auditory Stimuli-SNR represents either the stimulus played at +10 LUFS above the background audio level, or set to a level equal to that of the background audio.

Procedure
Participants were permitted to move freely around a pre-defined tracked experimental space of 1.6 m × 1.6 m while wearing the HMD. The combination of dynamic head and positional tracking allowed movement around the virtual environment with 6 degrees-of-freedom.
Throughout the experiment a support worker would be present to communicate with the participant and provide assistance if they became distressed. However, they were not permitted to deliver instructions. Prior to the start of the session, the experimenter would explain the use of the virtual reality system, the structure of the experiment and place the HMD on the participant in order for them to become familiarized with the device.

Phase One-Free Exploration and Spatial Audio Attention Testing
Prior to the main experimental task in Phase One, subjects were introduced to the virtual reality equipment and environment for a period of two minutes with minimal environmental audio and visuals. The experimenter explains to the participant that they are free to move around and explore the VE. This brief initial time was used to provision for any emotional excitement experienced by using virtual reality. The participant also receives support from the experimenter to become familiar with the virtual environment. Following this, participants were given a further five minutes free exploration during which they would be exposed to eight auditory events played once at pre-determined times, four of which had visual accompaniments and four had no visual accompaniments (see Table 1). This period would be used to record participant head rotation and horizontal tracked positional data. Once again the experimenter tells the participant that they are free to move around and explore the VE. After this verbal communication with the participant was minimized to control any influence that may lead to any guided exploration during the experiment.
An accuracy metric (a) was used to evaluate participant performance during the spatial audio attention task. This was quantified by calculating the relative difference between the participant head rotation on the azimuth axis (y) and the target position of the reproduced stimulus (n) at 100 ms intervals.
Further data recorded the target azimuth location in degrees of each spatial auditory event with the respective start and end times, and virtual environment positions. Higher values of a during auditory stimulus playback time would represent greater spatial attention towards presented auditory targets.
To better understand if spatial audio influences subject behavior tracked horizontal plane positional data within the VE represented by x and z was also collected at 100 ms intervals. A distance score was then calculated by measuring the distance between the participant and each of the presented auditory stimuli during playback time. Smaller distance values during auditory stimulus playback time would indicate movement towards the presented stimuli, indicating an influence of participant movement.

Phase Two-Spatial Audio Localization and Background Noise Testing
Participants took part in a localization task which required them to locate a total of eight hidden characters within the virtual world based on localizing sounds that they emitted and limited visual representations. Each would play either social or simple stimuli (see Table 2).
The experimenter first explains the task clarifying that the participant must use their ears to find each character. To aid the participant in becoming familiarized with the target stimuli, they are presented with an example character in their field of view which plays both social and simple auditory stimuli. Following this, the experimenter would activate each hidden character at a time when the participant was comfortable to do so and repeat the explanation of the task.
Once again support worker communication was restricted so as to avoid an effect on the outcome of the investigation. The length of this phase would be dependent on the participants ability to progress. When each virtual character was found the participant would be given positive reinforcement in the form of verbal praise.
Localization times of auditory stimuli within the virtual environment during Phase Two were recorded for each participant for all eight virtual characters. Further to group comparisons, exploratory analyses of participant reaction time also examined the relationship between the amount of background ambient noises and spatial attention. To do so, localization times for each character were compared based on stimulus type, background audio level and SNR.

Phase One: Spatial Audio Attention Testing
A two-way mixed ANOVA was conducted to investigate the impact of 3D audio and auditory stimuli on the accuracy metric during Phase One. A comparison between test groups yielded a significant difference (F(1, 26) = 29.43, p ≤ 0.001, η 2 p = 1.642) across all auditory events, with the 3D audio test group (M = 0.661) scoring higher accuracy scores than the mono audio controls (M = 0.487). Figure 6 shows a comparison of group mean accuracy scores with 95% confidence intervals for each event. In addition, there is a significant effect of stimulus type on spatial attention of both groups (F(7, 175) = 7.657, p ≤ 0.001, η 2 p = 0.227), with a significant interaction between groups and stimulus type (F(7, 175) = 2.426, p = 0.021, η 2 p = 0.072). Finally, the use of visual accompaniment had no significant effect on the accuracy of spatial attention in the spatial audio group yielded by visual accompaniment (M = 0.627 SD = 0.187) and no visual accompaniment (M = 0.695 SD = 0.184) paired-samples t-test; t(55) = −1.903, p = 0.062. However, there was a significant difference between visuals (M = 0.543 SD = 0.235) and no visuals (M = 0.431 SD = 0.176) within the mono control group; t(51) = −2.827, p = 0.007.
To better understand behavioral response to spatial audio within the virtual environment, tracked positional data across the two groups was also analyzed. The tracked positional coordinates of all participants on the horizontal axis collected throughout the entirety of Phase One and are visualized as 2-dimensional density plots shown in Figure 7. In addition, the figure also displays the directions and distance of all auditory events inside and outside of the tracked area. The figure shows differences in the extent of exploration depending on the auditory condition. Despite both groups showing a condensation of exploration around the center of the tracked space, the 3D audio group show a distribution of exploration in the direction of the majority of auditory targets. Further analysis of the distance between each participant and the auditory target using a two-way mixed ANOVA also yielded a significant difference between the auditory conditions (F(7, 175) = 23.111, p ≤ 0.001, η 2 p = 0.308). With participants within the 3D audio group moving closer towards the virtual sound sources during their respective playback times.

Phase Two: Spatial Audio Localization and Background Noise Testing
A two-way mixed ANOVA showed that these differences in localization times were also significantly different (F(7, 175) = 7.565, p ≤ 0.001, η 2 p = 0.232). Figure 8 shows the mean localization times for both experiment groups across all auditory events in Phase Two, which shows shorter localization times in the 3D audio condition (M = 8.4 s) than in the mono control condition (M = 17.8 s).   Figure 9. Estimated means data from hierarchical linear model, comparing the effect of background audio, signal-to-noise ratio, and stimulus type for binaural spatial audio and mono control groups. The whiskers denote 95% confidence intervals.

Phase One: Spatial Audio Attention Testing
Throughout Phase One, subjects in both groups displayed overall attention towards the presented auditory targets; however, those within the 3D audio group displayed showed higher levels of head orientation in the direction of audio events. These results demonstrate that despite reported impairments in interpreting ITD and ILD binaural cues associated with autism [65], spatial audio was capable of accurately attracting participants to the areas of the virtual environment applicable to the auditory stimulus's location.
Individuals with autism have been recorded as having similar evoked brain responses to speech-like stimuli than their typically developed peers, but at the same exhibit less involuntary attention towards it [66,67]. With this in mind, it would be expected that under the test paradigms of free exploration that measures involuntary attention, accuracy scores for speech stimuli to be the lowest. However, the findings from Phase One of this experiment are contrary to this for both condition groups, with the auditory event with the lowest scores being the crow (event 6). This auditory stimulus type would appear to be similar to the ambient environmental audio track and despite the lack of competing audio present the event was not unique enough to warrant participant attention. In the case of attention towards speech-like sounds in Phase One, these events are a combination of explosion and non-human speech-like stimuli, designed to create a unique audio cue that attracted attention towards them. Furthermore, studies have observed hyperactivity in the brain with response to novel auditory stimuli in children on the autistic spectrum [68].
Research by Lokki and Grohn [69] revealed that despite audio cues being less accurate for sound source localization than accompanying visual cues, within a virtual environment audio-only cues are almost equally effective as visual only cues. In addition, when objects were placed outside of the participants field of view or subjected to occlusion of other visual objects, audio-only cues were more successful in the preliminary phase of object localization. This could serve as justification for the significant effect visuals accompaniments had on the overall accuracy scores for participants in the mono audio group. The visual representations would compensate for the spatial inaccuracies of the audio rendering. This would also account for the similar scores present in both groups for event three (e.g., bubble sound). This incorporated visuals that spanned a large area of the VE which would attract user attention once it arrived into the participant field of view.
Exploration of the virtual environment is also illustrated in the positional data density plot (see Figure 7). For both conditions tracked horizontal movement was concentrated close to the starting position of (x, y = 0, 0). Nonetheless, participants listening to spatialized audio displayed clear movement in the direction of both the auditory only and audio-visual cues. Due to the grouping of both audio-visual and audio-only events it cannot be confirmed which events encouraged participant exploration in those areas. However, the reduced amount of tracked data in the direction of event three does suggest that the larger visual stimuli negate the need for participant exploration of the VE.

Phase Two: Spatial Audio Localization and Background Noise Testing
Alongside comparing localization times between auditory test conditions, Phase Two of this study was carried out to determine if reported impairments in decoding speech when background noise is present would have quantifiable effects on how they would respond to similar stimuli rendered via virtual spatial audio within a VR environment. Considering the heterogeneity of disorders in the autism spectrum present within the sample group, it is encouraging to see that all participants were capable of performing this task without any particular difficulties. However, the differences in performance between participants within the experimental groups may also provide explanation for the low effect size yielded by the two-way mixed ANOVA.
Firstly, results in Phase Two showed those in the 3D audio group performing significant better in the localization task. These results are consistent with similar research conducted with neurologically healthy participants, reporting that three-dimensional auditory displays can reduce localization times and improve audio spatial attention in VR [70]. Furthermore, these results provide more evidence that those with ASD are capable of successfully interpreting binaural information to take advantage of the spatial acuity provided by spatial audio rendering techniques.
In regards to the effect of background levels on selective hearing, results showed that localization times were significantly longer when the level of background noise increased and/or the SNR between the auditory target and ambient noise decreased. Furthermore, the type of stimuli also had a noticeable impact on localization times, with participants taking longer to correctly locate speech-like stimuli. This data is comparable to evidence of speech processing deficits in children with ASD in non-spatialized audio test conditions [44,52,67]. Although the results from the mono control group would be sufficient in evaluating the effects of background noise on the response to target audio, mono audio is rarely used with virtual reality environments. Therefore, for this study it was important that competing background audio had a significant effect on both experiment conditions. Individuals with ASD tend to demonstrate elevated speech perception thresholds, poor temporal resolution and poor frequency selectivity when compared to their typically developed peers [71]. The poor frequency selectivity alone could account for poor localization along the vertical plane [52], as this is primarily attributed to the analysis of the spectral composition of the sound arriving at the ear [49]. However, a combination of all three alongside a diminished ability to correctly translate ILD and ITD sound cues would account for poor selective hearing leading to difficulties with behavioral recognition of auditory targets in background noise [42,71].

Limitations and Future Work
One limitation of this study is the lack of a comparable typically developed control group. Comparison data may have highlighted any potential differences in performance between participants with autism and neurologically healthy counterparts. Nonetheless, participants in this study still showed higher spatial attention towards binaurally rendered spatial audio. Demonstrating quantifiable evidence of audio spatial interactions.
Another important consideration is the use of a non-individualized HRTF's database within this study. This database incorporates anechoic recordings using the Neumann KU-100 dummy head simulator which is-based upon average dimensions of the human head and ears [28]. HRTF's differ greatly between individuals, due to the unique shape of the pinnae, head, and torso. Therefore, the use of a generalized HRTF can sometimes result in localization confusion and reduced externalization of target auditory events within a virtual environment [72,73]. In addition, recent research which compared individualized to non-individualized HRTF's did observe higher localization errors when using a generic database [74]. This may provide some explanation for the varying accuracy and localization times between participants within the 3D audio experiment group as well as the lower effect scores yielded in the statistical analysis. However, there are significant challenges involved in obtaining individualized recordings which warrants the use of generic dummy head recordings in the development of virtual reality soundscapes [75].
Future investigations could also build upon the work carried out by Wallace et al. [55] and measure self-reported presence within virtual environments that use spatial audio rendering techniques. Changes in behavior, spatial focus, and interactions when altering the audio reproduction could be compared. Furthermore, studies could be conducted to measure if removing aspects of the background audio has any negative effects on the feelings of presence felt by the participants.

Conclusions
Alongside realistic graphical rendering and natural approaches to computer interaction, spatialized audio can be used to significantly enhance presence and improve user interaction within a virtual reality environment [76]. In terms of clinical applications within autism research, presenting a more realistic experience would benefit interventions such as the treatment of phobias and vocational training. By matching the visual and auditory sensory experiences of the real world, users will have a greater chance of generalizing newly acquired skills into their everyday lives, therefore increasing the possible positive outcomes of VR-based therapy [77].
To date, this study is the first to assess how those with autism respond to virtual spatial audio within a virtual reality game environment. The experiment involved 29 individuals diagnosed with autism being randomly allocated to 3D audio and mono control groups. All participants completed two experimental phases. The first compared spatial attention towards auditory stimuli by recording head rotation accuracy and horizontal movement towards presented audio. In Phase Two, participants were required to localize four social and four simple stimuli among varying levels of background audio. Results collated from head rotation and tracked positional data during the experiment do suggest that despite reported auditory processing difficulties, children with autism can make use of auditory cues rendered by spatial audio. In addition, the amount of background audio can have significant effects on the localization of auditory stimuli within a VE. These findings could provide valuable insight to those designing virtual reality interventions for the autistic population. Possible developers should make use of similar binaural-based spatial audio rendering approaches used in this study to increase the ecological validity of the virtual environment, deliver important information via auditory cues.
This study has shown that previously reported difficulties in auditory scene analysis associated with autism do extend into the realms of virtual reality and binaural-based spatial audio. The amount of competing background audio can have an effect on spatial attention towards virtual sound targets and so therefore this should be taken into consideration when designing three-dimensional virtual acoustic environments for ASD interventions.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: