Advances in imaging technology and natural user interfaces in the last decade have allowed the enhancement of public spaces with interactive installations that inform, educate, or entertain visitors using rich media content [1
]. These digital applications are usually displayed on large surfaces placed indoors or outdoors, e.g., on walls, screens, or tables using projection or touch screen displays, and users can explore and interact with them in a natural, intuitive, and playful way. In many cases, these installations include 3D content to be explored, such as complicated geometric models or even large virtual environments. Most installations displaying 3D content are currently found in museums and cultural institutions [2
], where the use of detailed geometry is necessary for the presentation of cultural heritage, e.g., by exhibiting virtual restorations of ancient settlements, buildings, or monuments. Other application areas of installations based on 3D content include interacting with works of art [3
], exploring geographical [4
] or historical [5
] information, performing presentations [6
], interacting with scientific visualizations [7
], navigating in virtual city models for urban planning [8
The design of 3D interaction techniques for installations requires special attention in order to be intuitive and easy to perform by the general public. 3D environments introduce a different metaphor and extra degrees of freedom, and new users can easily get frustrated through repeated ineffective interactions [9
]. The most fundamental, yet complicated interaction technique for any type of environment is user navigation. 3D spaces naturally require frequent movements and viewpoint changes in order to be able to browse the content from different angles, to uncover occluded parts of the scene, to travel to distant parts, and to be able to interact with objects from a certain proximity. Navigation is a mentally demanding process for inexperienced users, because it involves continuous steering of the virtual body, as well as wayfinding abilities. As far as steering is concerned, a difficult challenge for designers is the meaningful translation of the input device into respective movements in the 3D world [10
]. In most public installations with 3D content, user navigation takes place from a first-person point of view and involves the virtual walkthrough of interior and exterior spaces.
Typical desktop or multitouch approaches are not the most appropriate means of navigating and interacting with 3D content in public installations. A common setup in such systems is to present the content on a usually large vertical surface and to let visitors interact in a standing position at some distance, being able to look at the whole screen. The use of a keyboard or mouse is not very helpful for a standing user, whilst touch or multitouch gestures cannot be performed if visitors are interacting from a distance. To overcome these issues, public installations have been using solutions ‘beyond the desktop,’ usually based on natural user interfaces. Initially, the interaction techniques involved handheld controllers, such as WiiMote, or other custom devices, e.g., [11
]. However, the use of handheld devices in public settings raises concerns about security and maintenance. More recently, users have been able to interact with 3D content in public settings using body gestures, without the need of any additional handheld or wearable device. Developers have taken advantage of low-cost vision and depth sensing technology and have created interactive applications, in which users can navigate or manipulate objects of a 3D scene using body movements and arm gestures in mid-air.
A variety of sensors have been used for mid-air interactions in public installations, the most popular one being Microsoft Kinect. Kinect can detect the body motion of up to four users in real time and translate them into respective actions. As such, it is appropriate for standing users navigating and interacting with a 3D scene from a distance, and has been already deployed in public museum environments, e.g., [12
]. A secondary, less common option is the Leap Motion controller, which is considered faster and more accurate but is limited to hands-only interaction. The controller has to be at a near distance from the users’ hands and it is therefore more appropriate for seated users, which is somehow limiting for public installations. Numerous techniques for first-person navigation have been implemented using the Kinect sensor, such as leaning the body forwards or backwards to move [14
], rotating the shoulders to change direction [14
], walking in place [15
], using hands to indicate navigation speed and direction [17
], using both hands to steer an invisible bike [18
], etc. Two comparative studies have also been setup to assess the effectiveness and usability of Kinect navigation techniques in field or laboratory settings [15
], identifying preferences and drawbacks of the aforementioned techniques.
Most evaluations of interactive 3D installations using Kinect have concluded that it is a motivating and playful approach but not without problems. Some people feel embarrassed to make awkward body postures or gestures in public [19
]. Also, there are users who find the interactions tiring after a while because of the fatigue caused by some gestures, e.g., having to hold arms up for a long time. Finally, the presence of other people near the installation may cause interference to the sensor and therefore most of these installations require that an area near the user is clear from visitors. An alternative to mid-air interactions for navigating in public installations that has been recently proposed is the use of a mobile device as a controller [20
]. Most people carry a modern mobile device (smartphone or tablet) with them with satisfactory processing and graphics capabilities and equipped with various sensors. Following the recent trend of “bring your own device” (BYOD) in museums and public institutions [21
], where visitors use their own devices to access public services offered by the place, one could easily use her device for interacting with a public installation. For example, using the public WiFi, one could download and run a dedicated app or visit a page that turns her device into a navigation controller. The use of mobile devices as controllers has already been tested in other settings, e.g., games and virtual environments, with quite promising results [22
]. This alternative may have some possible advantages compared to mid-air interactions. It can be more customizable, it can lead to more personalized experiences by tracking and remembering individual users, and it could also deliver custom content on their devices, e.g., a kind of ‘reward’ for completing a challenge.
The aims of this work are to examine whether a mobile device used as a controller can be a reliable solution for first-person 3D navigation in public installations, and to determine the main design features of such a controller. We carried out two successive studies for this purpose.
In the first study, we sought to explore whether a smartphone controller can perform at least as good as Kinect-based navigation, which is the most common approach today. We setup a comparative study between mid-air bodily interactions using Kinect and tilt-based interactions using a smartphone in two environments and respective scenarios: A small museum interior, in which the user has to closely observe the exhibits, and a large scene with buildings, rooms, and corridors, in which the user has to effectively navigate to selected targets. The interaction techniques used in this study have been selected and adapted based on the results of previous research, i.e., we used a mid-air interaction involving the leaning and rotation of the upper body, which generated the highest outcomes in [15
] and was also one of the prevalent methods in [16
], and a technique based on the tilting and rotation of the handheld device, which was also discovered as usable in [7
]. A testbed environment developed for the study automatically measured the time spent to complete each scenario, the path travelled, the number of collisions, and the total collision time. Furthermore, subjective ratings and comments for each interaction technique were collected by the users through questionnaires and follow-up discussions. The results of the first study indicated that the smartphone performed at least as good as Kinect in terms of usability and performance, and it was the preferred interaction method for most of the participants.
Following the encouraging results of the first study, we aimed to look in more depth at the interaction techniques to be used for the design of a mobile controller. For this purpose, we setup a gesture elicitation study to collect preferred gestures from users and improve the guessability of the designed interactions [27
]. We had our participants propose their own gestures for a series of navigation actions: Walking forward and backwards, rotating to the left or right, looking up or down, and walking sideways. They were free to select between (multi-)touch actions, rotating or moving the whole device, or a combination of them, and they could propose any visual interface on the device. Whenever they proposed a gesture, we tested it in the museum environment of the first study using a Wizzard of Oz technique and had our users reflect about it. The results of the study led to interesting observations regarding the preferred gestures of users and the different ways in which users mapped mobile actions to 3D movements in the projected environment.
We present the results of our studies and a discussion about their implication for the design of novel interaction techniques for virtual reality applications presented on public displays.
2. Materials and Methods
Initially, we setup a comparative study between Kinect-based and smartphone-based interaction techniques for first-person navigation in 3D environments. Although the aim was to compare these two modalities, we decided to include keyboard input as a third modality in the study as well. The reason for this decision is first because the keyboard is a common input method for users, so it can be used as basis for comparison, and second because this input helped users to familiarize with the scenes in the scenarios, before trying the other two modalities. Furthermore, we decided to use four degrees of freedom (4DOF) for first person navigation instead of the two or three used in most of the other studies, because the extra degrees (looking up or down and walking sideways) are useful for architectural walkthroughs and virtual museums. The study focused on the perceived usability and performance of the input modalities in two different scenarios.
After the results of the first study that were generally in favor of the mobile device, we conducted a follow-up gesture elicitation study focusing solely on the smartphone control. Our aim was to address the following question: How do users contemplate interacting with public displays, using smartphone-based gestures for navigating in 3D environments?
2.1. Interaction Techniques
The navigation techniques we designed for the input modalities of the comparative study follow the same concept. The user can use a special action to switch between three navigation modes, each of which has two degrees of freedom. The available navigation modes are the following:
Walk/turn: Move forwards or backwards and turn to the left or right;
Walk sideways (strafe): Move sideways; and
Look around: Look up or down and turn to the left or right.
The interaction techniques for the three modalities (keyboard, Kinect, and smartphone) in each navigation mode (shown in Figure 1
) are the following:
For the ‘walk/turn’ navigation mode:
The cursor keys of the keyboard are used;
The user’s body is leaning forwards and backwards, while her shoulders have to rotate left and right in order to turn, for Kinect input;
The smartphone device must be tilted forwards or backwards to move to that direction and rotated like a steering wheel to turn left or right, while held by both hands in a horizontal direction (landscape).
For the ‘walk sideways’ (strafe):
The ALT key of the keyboard combined with the cursor keys is used;
The user’s one arm (either left or right) is raised slightly by bending her elbow and the user’s body is leaning left and right to move to the respective side, for Kinect input;
On the smartphone device, one button should be pressed (either left or right, as both edges of the screen work as buttons) and the device must be rotated like a steering wheel to walk sideways left or right.
For the ‘look around’ mode:
The CTRL key of the keyboard combined with the cursor keys is used: Up and down is used to move the viewpoint upwards or downwards, respectively, and left and right to rotate it;
Both of the user’s arms are raised slightly, while the user’s body leans forwards or backwards to look down or up, respectively, and rotates her shoulders to turn to the left or right;
The user presses both buttons of the smartphone device and tilts or rotates the device to turn the view to that direction.
A small pilot study was set up to calibrate the testbed environment prior to the main study. Four users participated in order to adjust the first-person controller movement and rotation velocity based on their feedback. They were asked to suggest any changes to the navigation and rotation velocity after some familiarization with the navigation techniques. The process was repeated until the users felt more at ease with the interface, while the speeds were recalibrated on the fly. All users achieved comparable required values, which were considerably slower from the testbed’s original values. Based on these outcomes, the testbed environment’s movement and rotation speeds were adapted to nearly 60% of their initial value.
Concerning the follow-up elicitation study, we intended to elicit alternative gestures and interaction techniques for each navigation mode regarding the smartphone-based modality using the same 4DOF. Therefore, we defined the following distinct tasks, one for each DOF:
Walk fwd-back: Walk forwards or backwards;
Rotate left-right: Rotate the viewpoint to the left or right;
Look up-down: Rotate the viewpoint upwards or downwards; and
Walk sideways: Walk to the left or right (strafe) without turning the viewing direction.
2.2. Equipment and Setting
A testbed environment was set up in Unity game engine to support navigation in 3D scenes and to record the user’s behavior, such as the travel path, duration, and collision. The environment directly supported keyboard and Kinect input. Additionally, a first-person controller component was implemented as a smartphone app in order to translate the user’s input into respective actions of the virtual body. It transmitted the rotation values and button presses to the testbed application through the WiFi network, and the virtual body moved accordingly.
Both studies took place in the laboratory and shared the same settings and equipment. For the testbed environment, a PC with Intel Xeon CPU 3.70 GHz, 16 Gb Ram (Intel, Santa Clara, CA, USA), and an NVIDIA Quadro K4200 graphics card (NVIDIA, Santa Clara, CA, USA) was utilized, and a projection screen through an Epson EB-X24 Projector (Epson, Nagano, Japan) was used to display the scene. For all three modalities in the comparative study and the one used for elicitation, each user was standing at a distance of about 2.5 meters from the projection screen. In the keyboard-based input, users were seated using a wireless keyboard; in the Kinect-based input, users were standing in front of a Microsoft Kinect 2 (Xbox One) sensor (Microsoft, Redmond, WA, USA); and in the smartphone-based input, they were standing at the same spot holding a Xiaomi mi4i phone that was running the controller app. Figure 2
shows the setup of the user study in terms of screen dimensions, sensor placement, and user distance from screen.
In the testbed setting, three distinct scenes were prepared and used. These were:
Familiarization: A simple scene that allows users to familiarize with each interaction technique. It displays a digitized version of the Stonehenge site.
Buildings: An indoor and outdoor scene displaying abandoned buildings with rooms and corridors. Users’ task was to walk around a building, and to carefully maneuver their virtual body through narrow doors.
Museum: A small interior scene featuring a digitized version of the Hallwyl Museum Picture Gallery. Users’ task was to walk through the hallways slowly and focus on particular displays.
presents a user interacting with the 3D environment using Kinect and a smartphone as input, and the buildings and museum scenes used in the study.
The museum scene was also utilized in the elicitation study. However, in this case, they could not actively interact with it. They navigated through a Wizard of Oz technique. Wizard of Oz is a common technique for evaluating early prototypes in human–computer interaction studies, where users believe that they interact with a system, but in reality, a human is partially controlling the systems’ response through observing user input. In our case, users performed the desired gestures and the evaluator controlled the viewpoint using keyboard input, to produce the effect that their actions had actual impact on the movement.
In the comparative study, 22 users participated, 11 males and 11 females. The age span was wide (between 20 and 50), and most users were under 30 (M: 25.4, SD: 7.8). The experience of the respondents was quite balanced with computer games and 3D environments. In total, 11 users reported that they had large or very large experience, 6 had none or little experience, and 5 medium.
In the elicitation study, 28 users participated, 15 males and 13 females. There was also a broad age range (between 19 and 47), but most participants were under 25 (M: 24.6, SD: 6.27). Concerning the participants’ experience with 3D computer games, the result was mixed, as 9 users reported that they had large or very large experience, 7 had none or little experience, and 12 medium.
The majority of the participants were students and faculty from our department in both studies. However, none of them took part in the both studies. Therefore, in the elicitation study, none of the users had prior experience with navigation in a 3D environment using a smartphone.
2.4.1. Comparative Study
The study adopted a within-subjects design, so each subject used all three interaction modalities in both scenes, i.e., 6 trials per user. The procedure was as follows.
First, users were introduced to the research purpose and procedure and were requested to fill out their gender, age, and experience with games and 3D environments in an initial questionnaire.
Next, they had to use the three interaction modalities and complete the scenarios. All users began with keyboard input, followed by the other two modalities, but with an alternating order, to counterbalance potential order effects.
The users had to navigate through all three scenes for each interaction modality. They were initially put in the scene of familiarization and allowed to navigate around using the interface until they felt comfortable with it. Then, they went on to the buildings scene, where they had to carry out a particular task: They had to enter a building and move in three particular rooms to assigned positions.
Third, they had to move around the museum scene and focus on four particular exhibits by attempting to bring them to the middle of the screen. In both scenarios, the target positions were shown to the users by the evaluators during their navigation, so the actual challenge was to steer their virtual body to the designated targets. At the end of each modality, users filled out a questionnaire for the interface and interaction technique with their subjective ratings and remarks.
Finally, users were asked to choose between navigation based on Kinect and a smartphone as their preferred technique and make any conclusive remarks. Each user session took about 35 min to complete.
2.4.2. Gesture Elicitation Study
In the follow-up gesture elicitation study, we used the following procedure.
First, users were introduced to the purpose and procedure of this study and filled in their gender, age, and experience with 3D computer games.
Then, they were given a switched off mobile device to hold and were asked to use it to propose gestures for the navigation tasks, in a way that seems more suitable/appropriate to them. We explained to them that they were free to use any possible action on the mobile phone, whether on the surface (touch/multi-touch actions) or based on motion (move or tilt actions), and they could propose any possible visual or interactive element to appear on the device’s screen. Also, they were free to hold the phone and place their fingers in any desired way.
Participants had to perform their proposed gestures for each of the navigation tasks in this sequence: Walk fwd-back, rotate left-right, look up-down, and walk sideways. For each task, we used the following protocol: First, we performed the action on the environment (using the keyboard input) and we asked the user to think of a suitable phone gesture to cause that effect. When the user was ready, she performed and explained the gesture (and the desired effects), and we tested it together on the environment using the Wizard of Oz technique. This helped users reflect on their choice and possibly even propose minor alterations. After that, the user was asked to provide her subjective ratings regarding the suggested gesture based on the questions, ‘how easy was it for you to produce this gesture’ and ‘how appropriate is your gesture for the task.’ These questions were adapted from the study presented in [28
During the study, participants were also asked to combine some of their proposed gestures for reflection and reconsideration. Again, we used the Wizard of Oz technique to test the combinations. The combined actions were the following: (a) Steering the virtual body by combining walk fwd-back with rotate left-right and (b) looking around by combining look up-down with rotate left-right. We tested the first combination after the rotate left-right gesture was proposed, and the second after the look up-down gesture. As expected, sometimes there were conflicts between proposed gestures or unusable combinations, and we discussed with users about their possible resolution.
Finally, users were asked to give any conclusive comments. Each session lasted approximately 20 min.
2.5. Collected Data
In both studies, we collected qualitative and quantitative data, both regarding each interaction technique (comparative study) and users’ intuitive suggestions concerning smartphone-based gestures for each of the navigation tasks (elicitation study).
In the first study, users entered their subjective scores in a 5-point Likert scale regarding: (a) Ease of use, (b) learnability, (c) satisfaction, (d) comfort, and (e) accuracy (see Appendix A
). The questions were chosen and adjusted from popular usability questionnaires. In addition, through the testbed environment, the following data were gathered for each task: (a) Task’s completion time, (b) number of collisions, (c) total collision time, and (d) path. The user’s collision with a wall or obstacle during navigation was deemed an error, so we decided to include both the amount of collisions and the total time the user was colliding (e.g., while sliding on a wall) as an indication of ineffective steering. The recorded route can provide a qualitative summary of each user’s navigation quality and allow the complete distance traveled to be calculated as well. Eventually, users were asked to pick the input modality they preferred, between Kinect and a smartphone.
In the second study, users gave their subjective scores regarding easiness and appropriateness of each proposed gesture using a 5-point Likert scale (see Appendix A
). Moreover, the responses of users were recorded both in video and in writing. The recorded videos provided us with more insights concerning the way users preferred to hold and interact with the device, their reflection and reconsideration of suggested gestures, and more qualitative information of each user’s thoughts and comments regarding the process.
According to the results of the first study, it appears that navigating in 3D environments using a mobile phone performs at least as good as using body posture and mid-air gestures. It was considered as more comfortable, and it was also the preferred choice for most participants. This is an indication that mobile devices could be used as reliable alternatives to Kinect and similar depth-sensors for 3D navigation in public environments. This finding is in accordance to the results of the study presented in [7
], where the smartphone is considered an attractive and stimulating solution for interactive applications in public settings. However, a further issue that needs to be investigated is whether visitors of museums and other public institutions are willing to download specific apps and use their personal phone as a controller.
The results of the elicitation study indicate a noticeable preference of surface over motion gestures when using the mobile phone as a navigation controller. Other usability studies have produced similar results, e.g., [29
], but also studies where motion gestures have been rated better compared to multitouch, e.g., [33
]. One possible explanation for our results is the fact that people are very well used to touch or multitouch actions in their daily experience with their mobile devices, e.g., for browsing content, interacting with interface elements, zooming and rotating media, etc., such that it is their first and most intuitive choice. Furthermore, as mentioned before, some more experienced users are affected by previous experiences with tilt-based control and they prefer to avoid it as it is less accurate.
Another interesting result of the study was a certain clustering between experienced gamers and non-gamers that was detected. The majority of people from the first group tried more or less to replicate the functionality of console game controllers or of 3D mobile game interfaces, holding the phone in a landscape orientation and proposing two virtual joysticks, one for each thumb. On the contrary, less experienced users were keener on using motion gestures, they used their imagination more, and in some cases, they proposed really intuitive solutions—although not all of them were feasible. Some of them proposed to have more automated actions and assistance, even if it means sacrificing some control or degrees of freedom. This finding is consistent with the results of previous studies [34
] that indicate the fact that first-person walking can be confusing for inexperienced users, raising a need for effective navigation aids. Interactive systems of this kind should preferably be designed to monitor user performance and, in case they detect non-expected behavior, to offer assistance in a non-intrusive way, e.g., through messages, indications, mini-maps, etc.
In some cases, users tried to bring familiar 2D gestures to some of the navigation tasks, resulting in the world manipulation metaphor (also termed “scene in hand”) mentioned before. For example, for walking sideways, they imagined dragging on the phone to the direction where the image should be moving, as if they are performing a horizontal pan action. Similarly, they proposed dragging with their finger down on the phone to move forward, as if they are grabbing the ground floor and moving it towards them. For some of these cases, we tried again in the opposite direction through the Wizard of Oz technique, but the users reported that it was confusing for them. A possible reason that some users chose a 2D gesture is that they had to perform single-axis tasks one by one during the study. So, inexperienced users who did not foresee the rest of the actions needed for navigation in 3D environments tended to assign them to familiar 2D interactions.
Finally, we have to mention that the results of this study are possibly applicable to wearable VR environments as well, as it has been already showcased that a second mobile device can be effectively used as a controller for a cardboard-based virtual museum [35
]. In that case, there is, however, a couple of important differences regarding the user interactions and context of use. First, two degrees of freedom (look up-down and rotate left-right) can be directly controlled through the user’s head rotation, so there is no need to include them in the mobile controller. The controller could therefore include extra functions for selecting or manipulating content. Second, given that the users will not be able to directly look at the screen of their device due to the headset, the interactions must be carefully designed, probably also using appropriate feedback-sound or haptics-, so that they can be performed ‘blindly.’
This paper examined the suitability of mobile devices as controllers for first-person navigation in 3D by presenting a comparative study between Kinect-based and smartphone-based interactions, followed by a gesture elicitation study for suitable navigation gestures using a smartphone. The results of the first study indicated that the smartphone-based input is at least as reliable as the Kinect-based input. It was preferred by most participants, considered more comfortable, rated higher in all subjective ratings, and produced significantly shorter task completion times in the first of the two scenarios. Furthermore, the second study highlighted the preferred input techniques for 3D navigation using a mobile device, which were considerably different from the techniques we designed for our first study. It also produced some interesting observations regarding the different expectations of users and their preferred ways of mapping their actions to 3D movements, based on their previous experience and background.
An important advantage of a modern smartphone compared to typical game controllers is that its interface can be fully customized, e.g., by adding custom virtual buttons on the screen or by supporting various multitouch gestures. As such, it is more adaptive to afford multiple interaction modalities with a 3D application. Designers could consider creating two or more alternative setups to support various user groups, e.g., experienced vs. non-experienced, with various levels of control and support, and even different gesture types.
An issue that needs further research regarding the design of 3D navigation techniques is the support of secondary actions. Often navigation has to be combined with other actions, such as selecting an object, browsing information, etc. Therefore, the interaction techniques designed for 3D navigation should leave room for other tasks that can be performed in parallel. This is an aspect that was not considered in our studies.
In the future, we are planning to further explore the prospects of combining smartphone-based interactions with virtual environments in public settings. We aim to design and develop a number of alternative interfaces based on the results of our elicitation study and compare them in terms of usability and efficiency. Furthermore, we are planning to extend the functionality of the proposed controller by adding non-intrusive user assistance through voice messages and icons, and by including additional forms of feedback, e.g., audio and vibration.