A Virtual Reality System for Practicing Conversation Skills for Children with Autism

: We describe a virtual reality environment, Bob’s Fish Shop, which provides a system where users diagnosed with Autism Spectrum Disorder (ASD) can practice social interactions in a safe and controlled environment. A case study is presented which suggests such an environment can provide the opportunity for users to build the skills necessary to carry out a conversation without the fear of negative social consequences present in the physical world. Through the repetition and analysis of these virtual interactions, users can improve social and conversational understanding.


Introduction
One in 59 American children is diagnosed with Autism Spectrum Disorder (ASD) [1]. While therapeutic supports exist (e.g., applied behavior analysis and cognitive-based therapy), they are costly and not always accessible due to geographic gaps in coverage or lack of available insurance funding. Assistive technology offers an alternative or supplementary approach to skill acquisition.
Children with ASD may experience difficulty with communication and behavior and often have social impairments [2,3]. Many researchers have focused their studies on communication training to address such challenges [4,5]. Some studies suggest that improved communication skills may lead to "improvements in daily living and social skills, and a reduction in behavior problems relating to social interactions for children with ASD" [2].
In the United States alone, researchers estimate that the total cost per year for a child with ASD is approximately $17,000 more than for a child without ASD [6]. These costs include medical care for the children and special education programs and therapies, as well as accounting for loss in parent work productivity. Medical expenses for children with ASD are estimated to be 4.1-6.2 times greater than the expenses for those who do not have a diagnosis [7]. The current intervention paths, although effective, are often unattainable for families.
Assistive technology can offer a means to practice skills through an inexpensive, less time-consuming, and more scalable option. Not only can assistive technology help children by allowing them to practice lessons outside of therapy, but they may also help professionals by providing data regarding behavioral and communication skills. Therefore, we designed and implemented Bob's Fish Shop, a virtual reality (VR) environment to help children develop social and conversational skills while also providing the script output for professionals to study. This paper presents the architecture of our system, which integrates gaze tracking and voice processing, in order to demonstrate the feasibility of building and using a VR-based assistive technology to help users with neurodiverse backgrounds to practice conversation skills. In addition to the design Bob's Fish Shop is an immersive virtual reality experience designed to help children with ASD practice typical social interactions and conversational skills. Implemented in Unity 3D and designed for the Oculus Rift VR headset (Figure 1), the goal of Bob's Fish Shop is to develop social and conversational etiquette while having children engage in a safe and supportive environment. In addition to verbal conversation, the game provides opportunities to practice nonverbal communication, such as responding to waving (Figure 2), and joint attention skills, such as referencing a person or an object of shared interest with the eyes. Video demonstrations of the system are available on the web here: https://github.com/mlat/vrpaper.  Though the game play of Bob's Fish Shop is simple, based largely on short interactions supported by text scripts, the underlying architecture of the game requires integration of several technologies. In addition to the VR itself, the game utilizes voice recognition to engage the user and the virtual shopkeeper, estimates joint attention based on the player's center of focus in the virtual reality environment (VRE), and incorporates rule-based artificial intelligence to guide transitions throughout the game.

System Development
We conceptualized the system based on the experiences of the research team, which includes a Board-Certified Behavior Analyst (BCBA) with 20 years of clinical experience. Additionally, behavior interventionists from a local ASD treatment clinic were consulted in order to design a virtual reality scenario that would be appealing to our target audience. This resulted in a simple scenario based on the interaction required for a person to successfully interact with the proprietor of a pet shop. We then devised system requirements from previous empirical work and from our concept in order to build the system. As currently built, the game leverages Maya (drawing software), Unity, C# scripting, and an external voice recognition software.

Scenario
The user begins in their virtual home and then leaves their virtual home to enter the virtual fish shop. Once the user has entered the shop, they examine the contents of the shelves and gain an idea of items they would like to purchase. They then engage the shopkeeper, Bob, with their gaze, signaling they are ready for a social interaction. Bob waves, then introduces himself, and offers his assistance to the customer in the shop, the user. Bob and the user then have a conversation regarding which items they would like to purchase in the shop.
We chose this use case because this type of conversation happens every day. For example, whether one is at a store purchasing items, at a restaurant ordering dinner, or at home telling a parent or caretaker what they would like to do the upcoming weekend, the applications of this particular social interaction are endless. Having the ability to express one's needs and desires is a skill used every day.

Implementation
The VRE comprises five primary software modules: Staging script, vision processing script, voice processing script, data archive script, and character animations. A diagram of the software is shown in Figure 3. The staging script component handles each of the possible stages that could be in the current social interaction. It is also able to call the visual processing script component and the voice processing script component to receive information from the user based on their vision and voice inputs.
The vision processing script component tracks where the user is looking. This is then documented in the form of a text file, so professionals can observe where the user was looking (i.e., in the expected place for a given exchange). An example of the output is presented in Figure 4.
Within the voice processing script component, the system calls an external application that processes the user's voice and sends the translated text back to the voice processing script component. The external voice-to-text application runs locally as a web service on the same physical machine as the other system components and is easily invoked using standard C# capabilities. Because the responsiveness of the system is critical for the user experience, no additional pre-processing is carried out on the audio received by the user. Instead, we make use of an industry-grade microphone to capture audio, which we have found to produce good enough results in practice that further audio cleanup is not necessary. Once the process is complete, the voice processing component can then parse the text and communicate which stage to transition to back to the staging script component.
The data archive script component documents data that are valuable from the interaction between the user and Bob. The information is transferred to this component from the vision processing script and the voice processing script.
The character animations component, which controls the shopkeeper's movements and actions throughout the scene, uses the trigger from the voice processing script and performs the appropriate animations based on the user's response.

Functionality
The basic functionalities that the user can do are: Walk into the room, look around the room, and communicate through voice to converse with Bob. With the use of the Oculus Rift SDK, we integrated Unity with Oculus functionalities. By replacing the main camera in the scene with a camera provided by the Oculus SDK, the user can move around the scene as if they were inside the virtual reality world.
Being able to track eye contact was a priority for this project. Tracking the location of the user's eyes in the scene allows the software to record where the user's visual attention is during the span of the conversation.
Since a conversation with Bob is the main functionality of this software, we recorded many variations of this interaction. In consultation with a BCBA, we created a baseline script, which mapped out an example conversation that the user and Bob could have at his store. Then, a few other variations of those phrases were made to prevent unnatural repetition during the conversation. When all of Bob's possible lines were created for the baseline example, a professional voice artist recorded all of the lines in a studio and a voice engineer cleaned up the recordings so we could use each individual line in our Unity project.
Once the voice recording files were clean and ready to be used in the scene, the digital artist animated Bob's mouth and body expressions to make it seem like he was speaking the words on the voice recordings. The eye contact scripts were also applied to Bob. The figures and animations were then added into the Unity project so the items in the shop were actual objects in the scene.
After the animations and 3D digital figures were added to the scene and the appropriate scripts were attached to trigger those animations, the different conversation stages were added. The staging script component in Unity handled the transitions from one part of the conversation to another. Based on the user's responses to Bob's interactions, the script either transitions out of the stage it is in, or it repeats the current stage.
During each of these stages, the voice processing script component collects speech from the user and processes it to determine what sequence of events should happen throughout the game. Within each stage class, an external application is called to pick up the voice of the user as an input. Then, using a voice recognition library, the user's voice is translated into text and sent back to the Unity software to be processed. The full statements said by the user are then recorded to a text file to be analyzed by parents and professionals using the data archive script component.
After that text is printed to the file, it is parsed for specific hot words that lead to the transitioning of the conversation. Some of these hot words include "food", "castle", "red fish", "blue fish", "yes", and "no". Over time, as the user continues to interact with the virtual world, their use of specific words is used to probabilistically determine the response received from Bob. This provides a simple mechanism for adding variety to the interactions in the system as well as encouraging the user to try different approaches in their conversation with Bob.
While this process is occurring, a timer is keeping track of how long it takes the user to respond to Bob. The time is also recorded to the text file. An example of the final format of the output looks like Figure 3.

System Validation
In order to provide basic validation of our system, we carried out a small technology probe with potential users. Our technology probe was inspired by the method founded by Hutchinson and colleagues [22], in which we tested our design and received feedback from our users. The user study consisted of one exploratory session, conducted in a university research lab. Two children participated in the user study, a six-year-old female diagnosed with ASD and a seven-year-old male diagnosed with Attention Deficit Hyperactivity Disorder (ADHD). The children were accompanied by their mothers, for supervision as well as to assist in the data collection process.

Data Collection
The children were immersed in the VRE for approximately 15 nonconsecutive minutes. During the study, each user wore the headset to navigate through Bob's Fish Shop, and they were able to communicate to one another as well as to their mothers during their experience. Throughout the duration of the study, data were collected through observation and interview questions. Interview questions included: "What do you see?" and "What are you looking at?"

Analysis
Once the user study was completed, researchers collected and examined all field notes and interview questions using a qualitative approach. Open coding [23] and discussions among the researchers were used to discover the emergent themes specific to the system. The main focus was to determine the feasibility and acceptability of our system in the ASD community.

Results
The study was a positive experience for the users, and minimal training was required to use the VR headset. The users quickly discovered the immersive nature and malleability of the system, while also interacting socially in the physical world.

Immersive
We specifically designed the VRE to be similar to cartoon animations because we wanted the characters and gestures to be familiar to the users.
"Wow, it's like being in a cartoon!" (s1, child with ASD) The primary use of a VR headset is to immerse the user completely in a new virtual world separate from our physical world, and it was evident that the user easily remained engaged because of this design. The users also expressed their surprise at how interesting and fun Bob's Fish Shop was, and they wanted to continue the session beyond the 15 min. Furthermore, our results were congruent with other recent studies in that cybersickness was not a concern with the Oculus Rift VR headset, and the users reported a pleasant experience [24,25].

Malleable
An important aspect to VREs is the ability to create whatever type of environment you want. Interestingly, the children recognized this feature early into the study. For example, while the female child enjoyed the idea of picking out fish, the male child wanted to change the scenario to something else.
"Could it be a pet store with cats?" (s2, child with ADHD) "Can I draw a dragon in Bob's Fish Shop?" (s2, child with ADHD) This understanding demonstrates how the details of the VRE are trivial. The social interaction is the key to this technology supporting social skill development. The details are simply a way to engage the user for the entire session.

Social Reciprocation
An emergent theme that was not anticipated prior to this study was the social exchange between the users outside of the VRE. One key objective to our technology is social skill acquisition through the user's engagement in Bob's Fish Shop. However, we were excited to see that social skills were immediately exercised through turn-taking in the physical world as well.
"Tell me what you see." (s2, child with ADHD) "When is it my turn again?" (s1, child with ASD) Because the VR headset can be easily taken on and off, the users were able to switch off after every few minutes during the session. This also allowed them to communicate their experience to each other as well as to their mothers.

Implications for Future Practice
As shown above in Figure 3, the three behaviors that are being recorded are: Where the user is looking throughout the span of the conversation, how long it takes for the user to respond to Bob, and the verbal exchanges between the two.
More specifically, we designed the system to detect where the child is looking throughout the conversation in order to measure attentiveness as well as how easily distracted the child is while the interaction is taking place. The system records the length of time (in seconds) it takes for the child to respond to Bob during the conversation in order to measure how long the child is paying attention. We captured the transcript to determine if the child understood what took place during the conversation and how decisive the child was during this spontaneous interaction.
This information can be reviewed and analyzed by the user, family, and professionals. The transcript guides therapists and parents to which areas that child needs more focus. For example, if it takes a child six seconds to respond to a question from Bob, then the child needs to work on delivering a quicker response. Similarly, if a child is looking at a red fish but wants to purchase fish food, then the child needs help with focused attention.
Children may also be able to see their own successes and mistakes through the text conversation. The user's progress can be tracked over time to see which areas have improved and which areas still need focus. Parents and therapists can review the text with the child and point out more appropriate behaviors and/or responses, promoting a positive and informative learning environment.
While our technology probe proved to be incredibly useful as a basic validation of our system's architecture, it is important to emphasize that this is not sufficient to draw any conclusions regarding the efficacy of the system in improving conversational skills. This requires a more formal user study with a much larger sample size, including users with both neurotypical and neurodiverse backgrounds. As such, the primary contribution of this paper is the technical development of the system, with the goal of providing a rigorous analysis of outcomes in our future work.

Conclusions
This paper presents a virtual reality environment, Bob's Fish Shop, which provides opportunities for users with neurodiverse backgrounds to develop necessary conversation skills in a safe and controlled environment. We effectively demonstrated that our VRE is an acceptable, feasible system that engaged our users and promoted social conversation by carrying out a technology probe with a small sample of two users. Future studies will explore whether the VR technology presented in this paper supports social skill development through a large user study. It would also be interesting to expand this VRE to a collaborative multiuser virtual environment, similar to those found in recent studies [26,27]. Finally, we would like to test our hypothesis that users with ASD who study their script outputs from this virtual reality gaming experience will notice mistakes and improve their conversational understanding.
It is clear, now more than ever, that the human species is diverse and our needs are different, including our sensory needs. Traditionally, our system (professionally, educationally, therapeutically) has been a one size fits all model. However, VR allows us to customize a unique experience, considering each individual's needs, abilities, and preferences.
Author Contributions: N.S.R. drafted the manuscript, prepared the literature review, and responded to peer reviewer comments. K.L. and J.R. were responsible for implementation of the virtual reality environment, including programming, artwork design, animation, and incorporating voice recordings. K.D. and L.B. provided clinical expertise in special education and human computer interaction, summarized user study results, and assisted with manuscript preparation. E.L. conceptualized the project, designed the system architecture, and assisted in manuscript preparation and review. The authors would like to thank the many members of the MLAT Lab at Chapman University who encouraged and supported this work. In particular, we are grateful to Justine Stewart, Amanda Benavidez, Brent Haub, and Rebecca Rost for their creative insight and talent, which substantially improved the project and this manuscript.