Understanding and Creating Spatial Interactions with Distant Displays Enabled by Unmodiﬁed Off-The-Shelf Smartphones

: Over decades, many researchers developed complex in-lab systems with the overall goal to track multiple body parts of the user for a richer and more powerful 2D/3D interaction with a distant display. In this work, we introduce a novel smartphone-based tracking approach that eliminates the need for complex tracking systems. Relying on simultaneous usage of the front and rear smartphone cameras, our solution enables rich spatial interactions with distant displays by combining touch input with hand-gesture input, body and head motion, as well as eye-gaze input. In this paper, we ﬁrstly present a taxonomy for classifying distant display interactions, providing an overview of enabling technologies, input modalities, and interaction techniques, spanning from 2D to 3D interactions. Further, we provide more details about our implementation—using off-the-shelf smartphones. Finally, we validate our system in a user study by a variety of 2D and 3D multimodal interaction techniques, including input reﬁnement.


Introduction
For decades, researchers investigated the way how we are interacting with distant displays by using mobile devices [1][2][3]. Initially, users controlled distant displays via buttons on a remote controller or later a touchscreen device. Here users mainly relied on uni-modal input made by their fingers by pressing buttons or swiping on a touchscreen. Besides uni-modal input with a single device [4,5], many researchers proposed multi-modal input modalities by using multiple tracking devices, e.g., using smartphone touch combined with glasses for eye-gaze input [6]. This allowed researchers to design even more powerful multi-modal interaction possibilities. The nature of the system as a whole changed as well. From a controller device-oriented system to an environment tracking-oriented system where users were observed in their space through which they can freely move while their movements were recognized and considered in the interaction. Hereby users could start performing spatial interactions with the distant display which was aware of their physical position relative to the distant display and the detailed motions of various body parts. On the negative side of creating such spatial interaction systems was however the fact that the overall systems also became more complex. Enabling inputs that relied on touch and motion of multiple body parts (i.e., as hand [7], body [8], head [6] or eyes [6]) required the augmentation of the user with handheld devices (e.g., smartphones, controller wands), wearables (e.g., glasses, bands, markers), or augmentation of the room with motion-tracking camera systems.
Our main motivation in this work is to find a new user tracking approach that enables a multi-modal interaction combining touch, hand-input, body-input, head-input, and gazeinput while completely removing the need for complex tracking devices. The particular Figure 1. TrackPhone uses simultaneously the front and rear camera of the smartphone for tracking of the absolute world-space pose of the user's device (hand), body, head, and eyes (gaze). Combined with touch inputs, this enables, after a simple app download, any smartphone user to perform powerful multi-modal spatial interactions for distant displays.
In this paper, we first unify the fragmented research under the umbrella of distant display and smartphone-based interaction. To present a unified overview and inform future research, we analyzed over 400 papers from this domain, synthesizing the state of this research field. We provide an overview of the enabling tracking hardware, input modalities, and interaction techniques of research and commercial systems. We continue by discussing key challenges of the domain and outlining which interactions should be included in an essential "must-have" bundle of interaction techniques for future distant display scenarios. Based on these pre-requisites, we present our approach for user-motion tracking using the SLAM technology already available in billions of off-the-shelf smartphones. By using both the front and rear camera of the smartphone simultaneously, we can enable multi-modal interactions by combining touch inputs with real-time world-space tracking of the user's hand, body, head motion, and gaze. Without no additional external tracking devices, users can easily interact with the 3D space and use a rich set of different interaction modalities. In this paper, we further present the results of an initial user study, evaluating our tracking system with a variety of multi-modal 2D/3D interaction techniques (i.e., touch, ray-casting, and virtual-hand inputs), including a dual-precision approach for input refinement. The results reveal that participants achieve very satisfactory performance in 2D as well as 3D interactions, that the refinement techniques can improve the pointing accuracy up to three times, and presents a deeper look into the performance of a wide spectrum of multi-modal interaction techniques.We conclude by presenting demo applications, where we look beyond the input possibilities of our system and showcase how we can easily use it to enhance the 3D user interface output on a 2D display by user-perspective rendering.
In summary, the main contribution of this work can be summarized as follows. We present • A comprehensive taxonomy and key research objectives addressing tracking devices, input modalities, and interaction techniques used for smartphone-based distant display interactions. • A novel, solely smartphone-based approach for user-motion tracking, capable of enabling multi-modal interaction by touch and user's hand, body, head as well as gaze tracking. • Two user studies that initially validate our novel tracking approach and compare the performance of various multi-modal interaction techniques, including a refinement mode, in 2D, and 3D distant display interaction. • Some demo applications that showcase the versatile input capabilities of our approach as well as enhancing 3D output. • Future work directions that are specific for smartphone-based tracking, however, still remain unaddressed.

Taxonomy
The goal of this taxonomy is to contribute a comprehensive analysis of the design space, where smartphones, with and/or without additional tracking devices, are used to drive the interaction with a distant display. For both new and experienced researchers, we provide a short overview of possible tracking techniques, different input modalities, and interaction techniques, spanning from 2D to 3D distant display interactions. Further, we discuss the trade-offs and impacts of the tracking hardware on the creative process of interaction design and the resulting interaction techniques. Based on our taxonomy, we outline key objectives that we address in this work.

Designing the Taxonomy
In order to create the main corpus of relevant publications, we conducted a systematic search in the field of HCI and analyzed about 405 papers of the last 50 years. Our search included terms: phone, smartphone, mobile, handheld, controller, spatial, vertical, spatially-aware, cross-device, distant, public, situated, remote, large, pervasive, wall, display, interaction, interfaces, as well as their acronyms. The papers that got included in our corpus had to be concerned with the interaction tasks or techniques and tracking technology for people and/or devices. By looking at references within our corpus as well as using our own expertise, we identified additional articles (which is a common strategy for survey and taxonomy papers [9]). We tagged all papers from the initial corpus for the usage of a distant display or a smartphone in their interactive setups. The result was three subsets of papers, including • Seventy-three papers using a smartphone together with a distant display, optionally with additional tracking devices. • One-hundred-and-fifty-seven papers using a distant display with other input devices than smartphones (e.g., wands, smartwatches, lighthouse trackers). • One-hundred-and-seventy-five papers using the smartphone alone (e.g., on-phone user interfaces) or with other devices (e.g., eye-trackers, pens, head-mounted displays, other mobile devices).
Based on our focus, we examined the found 73 papers using a smartphone and a distant display (concurrently) in our taxonomy. Of course, we also considered older mobile devices (without touchscreens) and tablets that were used with a distant display. In the next step, we tagged all these papers for the tracking hardware, input modalities, and interaction techniques (see Tables S1 and S2). We would like to acknowledge that our references are not an exhaustive listing of all the papers in the context of distant display and smartphone research, but instead, they represent a curated subset of the most relevant ones.

Categorization
Generally, we often divide tracking systems for enabling distant display interactions into two categories: inside-out and outside-in. In inside-out tracking, the tracking cameras or sensors are located on the device being tracked (e.g., head-mounted display worn by the user, smartphone held by the user) while in outside-in the sensors are placed in a stationary location in the user's environment observing the user or the tracked device from distance (e.g., a Microsoft Kinect on Nintendo Wii sensor placed on the TV looking in the direction of the user). By analyzing all the related papers, we agree with Brudy et al. [9], that outside-in tracking provides high-fidelity information as opposed to a more light-weight low-fidelity inside-out tracking setup. The main reason is that for inside-out systems, the reliable 3D tracking of the smartphone and the inclusion of the user as part of the sensing is still a major challenge [9]. However, none of the inside-out papers addressed this issue so far.
Our analysis also revealed that we can group papers using smartphones to enable distant display interactions, into two categories: • Inside-out: interactions, enabled solely by using the technology, embedded in off-theshelf smartphones. • Hybrid: interactions, enabled by using inside-out tracking capabilities of smartphones, which get further enhanced by additional tracking devices, i.e., tracking devices attached to the smartphone, worn by the user, or placed in the room.

Motivation behind Tracking Systems
The compelling idea of "bring-your-own-device" [18] enabling out-of-the-box interactions assures that inside-out smartphone tracking remains a constant focus of many researchers in the domain of distant displays [19]. They challenge the smartphone-embedded technology, to enable innovative 2D [5] and 3D [20,21] interaction techniques, by thoughtfully steering between harsh constraints of the given tracking hardware, only to prevent the need for instrumenting the environment with cameras and buying or wearing dedicated devices.
Looking closer at hybrid tracking solutions, we noticed that researchers used mostly a closed hardware infrastructure, which is not always easily accessible to the mainstream. However, they contribute very important findings regarding the performance and experience of various uni-modal as well as novel multi-modal interaction techniques for distant displays [6,22,23].

Tracked Input Modalities
Using only the smartphone to enable the interaction with a distant display allows users to use their fingers for touch inputs and movement of their device-holding hand for hand motion input. These device-tracking inputs are enabled by the device's touchscreen [5] and pose tracking (i.e., device tilting [1,24] as orientation, motion as translation [3,25], or pose as translation + orientation [2]), while touch-input and tilting the device represent the majority of the past as well as recent papers, we can see a trend towards utilizing SLAM [26] that is natively implemented on modern smartphones, e.g., ARCore [27], ARKit [28]. These libraries show how SLAM can be used for precisely tracking the absolute worldspace pose of the smartphone (i.e., the motion of the user's device-holding hand) in distant display scenarios [2]. However, approaches that would extend currently known device-tracking inputs for body-tracking of the user's body, head pose as well as gaze still remain undiscovered (see Supplementary Materials).
Hybrid setups are as well often used for tracking the smartphone pose [22,29] by room-based cameras [30] or smartphone attached lasers [31][32][33]. Furthermore, they enable many body-tracking input modalities as head pose [6,23], body pose pose [34-36] and eye-gaze [37,38] tracking, using glasses- [38] or room-based trackers [39]. Although smartphones SLAM can already sufficiently track the world-pose of the smartphone, making external tracking devices obsolete in that regard, outside-in systems still lead in enabling tracking of multiple body parts of the user [9].
Objective 1 of 2: Maximize the number of enabled input modalities: Enable world-scale smartphone-based motion tracking of the user's touch, hand, head, body and eye-gaze.

Interaction Techniques
Choosing the right interaction techniques for inside-out systems is still a challenge, even for experts in this field. The main body of inside-out approaches investigated and compared different touch [23,40,41], ray-casting [19,24], plane-casting [21,42] and translationbased virtual-hand [2] or peephole [4,43] interactions, ranging from 2D pointing [19], collaboration [20], to 3D object manipulation [41]. Due to the limited input modalities, researchers need to make significant cuts into the effectiveness of their interaction techniques, although they explain that another technique would be more sufficient, however, it would require external trackers. For example, in a recent paper, Siddhpuria et al. found no practical way to detect the absolute orientation of the smartphone relative to the distant display by using IMU [19]. Therefore, they needed to use "fix origin" ray-casting [24], instead of the preferable real-world absolute one. Such compromises create a captivating pull for an increasing hardware complexity and put researchers into a difficult decision between rich interaction techniques versus setup simplicity.
Hybrid setups showed numerous distant display use cases, where knowing the bodyposition of the user, or the looking-direction can enable highly efficient interaction techniques (e.g., ego-centric [44], head-pointing [23], or gaze+touch [6,38]). Hybrid tracking also allowed researchers to investigate more complex user interfaces, as 3D data visualizations, which require multiple input modalities that simultaneously enable many degrees to control, for 3D object translation, rotation, scale, and also 3D camera viewpoint controls [45,46]. Furthermore, a historic overview of hybrid systems points out the importance of the smartphone's touchscreen. Due to the Midas problem [6] present in input modalities that have only one input state (e.g., hand, head, gaze, or body motions), the touchscreen played an essential role in reliably segmenting these into meaningful interaction techniques. This made smartphones to an indispensable device even in many systems with external tracking. In our taxonomy, we separated the input modalities between the two tracking categories to show the difference in the role of touch in inside-out and hybrid setups. This is mainly that inside-out systems are much more touch heavy compared to hybrid systems which focus more on the motion of the device in 3D space, body interactions and head or gaze inputs, and use touch only for clicking and clutching.

Hardware Complexity
In inside-out systems, researchers challenged the boundaries of each new smartphone adapted technology, they enabled inputs by using hardware keys [47], joysticks [25], inertial measurement unit (IMU) [1], touchscreens [5], cameras used for optical flow [3] or SLAM algorithms (i.e., simultaneous localization and mapping) [2] for distant display interaction. Researchers using hybrid setups put enormous effort into in-lab hardware setups to investigate spatial interactions with distant displays. For example, in a recent work, Hartmann et al. used six Kinect cameras (each connected to a PC), and a ten-camera Vicon motion tracking system for real-time tracking of the pose of the smartphone and the user's head (users additionally wore a hat with IR markers) [48]. These are significantly complex hardware setups for tracking the smartphone and head pose, however very common among researchers in the domain.
From the large body of papers using outside-in trackers (e.g., vision-based lighthouse trackers), we can note that these are often in-lab systems; consequently, they are often more complex to set them up and/or replicate the results. They also often require a lot of dedicated space as they are not fully mobile, need to be calibrated carefully, require additional computational units, use a limited tracking range and field-of-view, and need to be used in a controlled environment with a defined number of users, which need to move accordingly (preventing user and camera occlusion). On a technical level, spatial interaction research needs a more practical solution for testing and refinement for situations outside the lab, to support wider-scale use and in-the-wild deployments. Researchers in academia and industry have begun to point out and tackle this infrastructure problem, but no efforts have been strongly focused on minimizing the hardware setup to enable an out-of-the-box experience [9].

Objective 2 of 2: Minimize the need for outside-in tracking for distant display scenarios:
Propose an inside-out smartphone-based alternative with sufficient and comparable input capabilities.

Summary
We can see how the division of tracking technology in inside-out and hybrid setups, influences the smartphone-based distant-display interaction on many levels. None of the existing works was able to bridge the gap between the two streams of research, by providing a sufficient inside-out solution in terms of the number of tracked input modalities as well as precision. Using a smartphone for enabling distant display interactions is therefore still mainly limited to device-tracking inputs and lacks the inclusion of body-tracking inputs. Based on our analysis, we believe that the essential bundle of interaction techniques for future distant displays should consist of: touch inputs and device (hand), head, gaze, and body pose tracking. Being able to track all these input modalities without any external devices by only using an off-the-shelf smartphone would enable many powerful spatial interactions, discovered over decades of in-lab research, for everyone (e.g., head or gaze pointing, virtual-hand or peephole interactions, body-centered inputs). First steps were already made for using smartphone-based world-tracking in the domain of distant displays [2], handheld AR [49][50][51], head-mounted displays [52] as well as using face-tracking in the domain of cross-device [53] and on-phone interactions [54]. Using simultaneous world-and face-tracking on off-the-shelf smartphones, however, still remains unaddressed, since just recently the first examples of the technology were featured for handheld AR use-cases [55].

TrackPhone
In this paper, we also want to present the smartphone application TrackPhone as an iOS ARKit solution, based on our preliminary work [56], where we used two smartphones, combined in a single phone case, to mimic simultaneously world-and face-tracking before it was officially released (see Figure 2). By that, we could start initially implementing our tracking framework that allowed world-space phone (hand), head, and body pose tracking. In that paper, one phone performed world-to-device tracking (i.e., smartphone pose), while another phone was used as the device-to-user tracking (i.e., head and body pose). For TrackPhone, we adapted the two-phone approach to a single-phone implementation using the latest ARKit releases that natively support simultaneous world-and face-tracking. Furthermore, we extended our tracking framework for eye-gaze tracking. Hereby other researchers can access the single or a two-phone implementation, which is still relevant for Android phones since ARCore still does not support simultaneous tracking and therefore requires a workaround. TrackPhone provides tracking data that can be described in arbitrary real-world space and scale, but also as motions relative to a real-world object, for example, a distant display. To provide motion relative to a real-world object this object needs to be registered by the image tracking feature of ARKit. This procedure needs to be conducted only once since the tracking information gets saved for further usage on the device's internal world map. Otherwise, no calibration or other preparation steps are required. Users only need to download the app to their own smartphone and they can use it out-of-the-box. In our overall system, the smartphone app, sends all the tracked and processed user's inputs to the distant display computer which renders the user interface.
In total TrackPhone enables simultaneous real-time tracking of the 6DOF world-space position and orientation of the smartphone (i.e., hand) and user's head, body, and eye (i.e., gaze) motions. The touchscreen of the smartphone can be used for clutching, tapping, dragging, or other touch inputs as usual. The body pose is calculated based on the head and smartphone pose and represents a single 3D point in space-the body pose (attached anonymized preliminary work). This is a common approach to detect user motion or travel in a distant display scenario. For example, in outside-in systems (e.g., Vicon, Optitrack) users often wore head-mounted IR markers which represented the body's position and orientation [35,57].
When performing simultaneous face-and world-tracking from a single handheld device, we need to be aware of an accumulated tracking noise error, since any tracking noise present in world-to-device tracking will be added additionally to the device-to-user tracking noise. For example, if the user has a jittery hand, this affects the smartphone pose as well as the head, body, or eye pose. Therefore, we use the 1 Euro filter to minimize the tracking noise [58,59].

Evaluation
To validate our novel smartphone-based tracking approach we performed a user study to provide quantitative data about the performance of our system. Based on related work on smartphone-enabled distant display interaction, we found studies comparing: touch and head-pointing [19], touch and ray-casting [19,23], touch and a virtual-hand [29,60], and touch and gaze [38]. Therefore, we performed a unified study comparing touch, ray-casting, virtual-hand, head-pointing, against each other to gain further insights into the performance of multi-modal interaction techniques. Since eye-tracking just got recently implemented for simultaneous world-and face-tracking use, we were not able to include gaze in this study as well. Similarly as in other works we included a dual-precision approach [6,23,61], where one primary modality or input mode is used for coarse ("suggesting") and a second for fine-grained inputs ("refinement"). This is especially practical for selecting very small targets. Since most of the related studies investigate the previously mentioned techniques with 2D user interfaces, we included also a 3D user interface task for the distant display. We conducted two studies, in the first study, we primarily focused on a 2D pointing and selection task, while in the second study, we focused on 3D selection and manipulation (6DOF docking task). In both studies, we particularly focus on finding the overall best primary and refinement techniques, but also the best combination between them.

Apparatus
In both studies, we used TrackPhone in a projector-based setup using an Epson EH-TW5650 with a projected image size of 170 × 95.5 cm (1920 × 1080 px) as a distant display. As in similar studies were instructed participants to stand on a marked spot, centered and 2 m away from the projected image [19,23,62].

Investigated Interaction Techniques
The overall goal was to find interaction techniques that require little physical overhead and which provide high accuracy while interacting with very small target objects. In the following, we discuss four different interaction techniques (see Figure 3):

Touch
Pointer Hand Head Touch: Touch maps relative user's finger movements on the smartphone's touchscreen to the distant display's cursor [19,20]. The space of the distant display (170 cm × 95.5 cm) is mapped directly to the touch area on the phone, which spans over the entire width (6.9 cm/1242 px) and using the according height (3.9 cm/698 px). The primary mode uses a 1:1 mapping, while in the refinement mode, a 2:1 mapping has been used, as proposed by Kytö et al. [61], requiring twice as much finger movement to move the cursor for the same amount as in the primary mode.
Pointer: Pointer is a ray-casting technique [19,63], where the ray originates from the smartphone and points forward. This absolute pointing principle is used in the primary mode, while the refinement mode is based on relative pointing using a ratio of 2:1. This means that starting with the current cursor position when switching to the refinement mode, only a half-degree per physically performed full-degree in smartphone rotation is applied to the ray's orientation update.
Hand: For mapping hand motions we use a virtual-hand metaphor [2,64,65], where a virtual hand (i.e., cursor) in 2D/3D is controlled directly by the user's hand motion. The hand's position is tracked within a virtual control space of a 50 × 20 × 20 cm, which, in primary mode, directly corresponds to the display space of 170 × 95.5 × 95.5 cm (1920 × 1080 × 1080 px). Like in the previous techniques we switched to a 2:1 relative mapping for the refinement mode.

Head:
The head-pointing technique works similarly to the pointer technique, with the difference that the head position is the origin of the ray and the ray direction is defined by the head's forward direction [6,23]. The refinement mode works similarly to the pointer technique.

3D Interaction
While all four interaction techniques can be used for 2D user interfaces, only Hand directly includes 3D capabilities. The other three techniques can be extended by positioning the cursor on the x-and y-axis as defined by the interaction technique in 2D and controlling the cursor's z-axis following the principle of the Hand technique (i.e., fishing reel metaphor [66]). When using the Pointer technique, for example, users are tilting the device to move the cursor on the x/y plane and moving their hand forward and backward to adjust the cursor on the z-axis.

Touchscreen Interaction
To select a 2D target or grab a 3D object, we used the touchscreen of the device. The selection was implemented as a short finger tap (touch down-up < 250 ms, as in [19]) and was set as the fundamental selection method due to its fast, reliable, and deliberate interaction. Additionally to the tap, we also implemented a grab mechanic (touch down > 250 ms) for 3D object manipulation. As we also used a tap to switch between the primary and refinement technique, the touchscreen was therefore horizontally split evenly into two areas, as in [67]. The bottom area was used for object interaction while the upper area was used to activate the refinement mode.

Participants
In total, 18 paid volunteers (7 female, M = 28.9 (SD = 6.0) years) participated, selected from different departments of the local organization. 6 participants had intermediate experience with device ray-casting techniques (Nintendo Wii).

Study 1: 2D Pointing
The first experiment investigated the performance of TrackPhone and all four interaction techniques (Touch, Pointer, Head, Hand) while using them in a 2D pointing task in combination with a distant display. We cross-combined all techniques with each other so that each possible combination of primary technique and optional refinement technique was considered. In this study, we used explicit target selection, triggered by a touchscreen tap.

Task
Our study design was based on the experiment of Vogel et al. [62] and participants were required to select a circular target with a diameter of 5 cm using a cursor. The cursor was represented as a circle (diameter of 1 cm), and changed to a 2 cm large crosshair in the refinement technique. Targets were positioned randomly across the whole distant display, while the distance from the previous to the next target was always fixed at 50 cm. Before the study, we asked participants to align the cursor with the target as precisely as possible and to do so as quickly as possible. To discourage excessive effort on accuracy at the expense of time, a limit of 5 s (based on pilot study results) was placed on each trial.

Design
A repeated measures within-subject design was used with two factors: primary technique (Touch|Pointer|Hand|Head) + refinement technique (None|Touch|Pointer|Hand|Head). Due to the incompatibility of the two techniques, the pair of Pointer+Head was excluded. For each technique, participants completed a block of 20 target selections (in addition to 5 blocks for training). Overall, each participant completed 19 techniques × 20 blocks = 380 trials. We randomized the order of the techniques for each participant.
For each trial, we measured the time of the selection error, which is defined by the difference between the cursor and target position upon selection. After each technique, participants were asked to provide their subjective-feedback by commenting on the currently experienced primary and refinement technique regarding ease-of-use, physical, and mental demand. The subjective-feedback phase was also used as a break from the interaction to mitigate physical fatigue.

Results
We performed a repeated-measures ANOVA (α = 0.05) for both time and error. When the assumption of sphericity was violated (tested with Mauchly's test), we used the Greenhouse-Geisser corrected values in the analysis. The post hoc tests were conducted using pairwise t-tests with Bonferroni corrections. Time and error analyses included only successful target selections (6782 of 6840 total trials, or 99.2%). For the analysis, we considered all instances where a particular technique occurred as the primary mode or refinement mode, respectively. Note that all pairwise comparisons presented below are on a significance level of p < 0.001, unless noted differently.

Time
We found a significant main effect on time for the primary modes (F 3,957 = 362.39, p < 0.001), refinement modes (F 4,1276 = 174.30, p < 0.001), and the interaction between the primary*refinement-interaction techniques (F 12,3828 = 93.19, p < 0.001), see Figure 4. For the primary mode, which supported participants by moving the cursor quickly over a large distance, we found that the Pointer was the fastest mode, followed by Hand and Head (equally fast), while Touch was the slowest, see Figure 5. For the refinement mode, which is used to refine the cursor's final position Touch was the fastest mode, followed by None, Hand, Pointer, and Head, which were all significantly different from each other. We can see that some interaction techniques are great for fast and coarse pointing but are too slow for fine adjustments. Touch and Pointer are clear examples of that, while Pointer is the fastest as a primary technique its the second slowest for refining, and while, Touch is the slowest for coarse pointing it is the fastest for refining. Looking at all 19 combinations between primary and refinement techniques, we see that certain pairs are faster or slower than expected based on the general average of each individual mode. We assumed that the Pointer-Touch would be the overall fastest technique since it combines the fastest primary and fastest refinement mode. However, this was not the case. The technique Head-Touch was even faster. This shows how much performance can be achieved if we choose the right combination. If we look at other techniques, where Head is the primary mode (e.g., Head-None, Head-Pointer, Head-Hand), we can see that they are only mediocre in terms of speed, however, the particular combination of Head-Touch stands out as it is the overall fastest. The slowest technique Touch-Head, confirms the results from the individual modes are still worth considering since Touch is indeed the slowest for primary mode and Head slowest refinement mode. All significant effects can be seen in Figure 6.

Error
We also found a significant main effect for the error between the primary techniques (F 3,942 = 65.249, p < 0.001), refinement techniques (F 4,1256 = 160.278, p < 0.001), and the interaction between the primary*refinement techniques (F 12,3768 = 20.540, p < 0.001), see Figure 7. In terms of accuracy, we need to point out that we were positively surprised by the overall accuracy of TrackPhone. Independently of which technique was used, participants achieved a precision level of less than 1.14 cm.
The study showed, that both Touch and Pointer were the most accurate primary modes (p = 0.57), followed by Hand and Head. However, unsurprisingly the refinement mode was more important for accuracy since it is used to correct the pointing error. The results show that Touch was the most accurate refinement mode, followed by Hand, Pointer and Head which were not significantly different, and None as the most inaccurate. From all 19 interaction techniques, we can see that techniques where Touch is used for refinement are highly accurate, with Touch-Touch being the most accurate of all.
Overall, our results show that a refinement technique can improve the accuracy for up to three times, if we take the example of Head-None and Head-Touch, see Figure 8. Our results support findings from Kytö et al. [61], who showed that refinement in head-mounted AR can improve accuracy for nearly five times, as well as the findings from Šparkov et al. [68], Chatterjee et al. [69] and Jalaliniya et al. [70], who found a threefold accuracy improvement for combining eye-gaze with hand-gestures or head-pointing for distant displays. Head-Touch (with refinement) technique. The coordinate system represents the distance from the exact target centre.
We can see that even in interactive systems where only one input modality is provided offering a refinement mode based on this modality, can significantly improve accuracy, as shown by Head-Head, Touch-Touch, Pointer-Pointer, Hand-Hand, and Head-Head results, compared to their *-None counterparts. In a multi-modal system, we can achieve even higher accuracy by mixing modalities (e.g., Head-Head vs. Head-Touch). Similar results were also found by Kytö et al. [61], who found that head-head was more accurate than any other technique.

Qualitative Results
All 18 participants expressed that the combinations Head-Touch, Pointer-Touch and Pointer-Pointer are the best. This positive feedback was supported by comments such as P9: "Head-Touch, this is really good, since I only need to look at the target and the cursor is already there, then I just fine-tune. For bigger targets, you would not even need a cursor." and P12: "Pointer-Pointer, I like this one, since you do not need to do any touch actions, besides the tap". Participants further provided valuable comments explaining that touch can feel slower compared to non-touch techniques: P16: "Touch-Touch it's very precise, however, I need to perform multiple swipes-which makes me slow". They also pointed out that switching between modalities, for the refinement, can reduce performance and be mentally demanding: P9: "Pointer-Hand, the input movement here feels the same as Pointer-Pointer, although its harder-I would rather stay in the same modality as Pointer-Pointer", P13: "Head-Hand, I am slow, since I need to stop rotating my head and start moving my hand. This makes me slow, and I need to actively think about doing it also". Finally, many participants were positively surprised about the performance of only the primary techniques, used without refinement: P6: "Hand-Only, this is surprisingly precise".

Study 2: 3D Selection and Manipulation
In the second experiment, we investigated the performance of TrackPhone and and our interaction techniques (Pointer, Head, Hand), while using them in a 3D docking task in combination with a distant display. We excluded the Touch techniques from this study, since the touch interaction conflicts with the 3D manipulation, where touch is required for "grabbing" 3D objects. As in the previous study, we cross-combined all primary techniques with all refinement techniques, providing us with an overall of 11 interaction techniques.

Task
Our study is based on the experiments of [29, 71,72], where participants were required to perform a 6DOF docking task-selecting a 3D object in 3D space and aligning it with a target object matching position, rotation, and scale.
For rotating and scaling the 3D object, we used 3D widgets [73] since it is an intuitive and conventional method to manipulate 3D objects [64], while participants moved the 3D cursor, it was automatically pre-selecting the closest 3D object, or manipulation widget, as proposed by Baloup et al. [63]. The pre-selected object or widget then had to be selected. Once selected, the 3D object had to be translated, with the translation being directly mapped to the cursor. In addition, users had to select the 3D widgets to rotate/scale the object accordingly. We used axes separation [74,75] on the 3D widgets, by which the cursor's up/down movements were scaling the 3D object uni-formally. Left/right movements were used to rotate the object around the y-axis. Similar to the first study, the cursor changed from a sphere (1 cm diameter) in the primary mode, to a 3D crosshair (3 × 3 × 3 cm) in the refinement mode. For each trial, the position of the 3D target was randomly assigned from a pre-defined list of all possible positions. The list was generated before the study and included 3D coordinates which all had a different 3D position on all axes. The rotation and scale of the 3D targets were randomized for each trial, in a way that it was always at least 90-270 • differently rotated and at least 10-30 cm differently scaled than the previous target. The docking task was successfully completed once the position on each axis matched with <2 cm, the rotation differed by <4 • , and the scale for <2 cm. We defined these parameters based on a previous pilot study and would like to highlight that this requires very high precision-when fulfilling these constraints, the two objects were visually perfectly aligned. Again, we asked participants to align the two 3D objects as precisely as possible and to do so as quickly as possible. To discourage excessive effort on accuracy at the expense of time, a limit of 40 s was placed on each trial (see Figure 9).

Translate
Scale Rotate Apparatus Figure 9. Real-world study apparatus for the 2D/3D study. Required interaction in the 3D study, translate, scale, and rotate.

Design
A repeated measures within-subject design was used with two factors: primary technique (Pointer|Hand|Head) + refinement technique (None|Pointer|Hand|Head). Due to the incompatibility of the two techniques, the combination Pointer+Head was excluded. For each technique, participants completed a block of 3 (in addition to 3 blocks for training) docking tasks. Summarizing, each participant completed 11 techniques × 3 blocks = 33 trials. We randomized the instruction order of the techniques for each participant.
We measured the time of each trial. After each technique, participants were asked to provide their subjective-feedback by commenting on the experienced primary and refinement technique regarding ease-of-use, physical, and mental demand. The subjective feedback phase was also used as a break from the interaction to mitigate physical fatigue.

Results
We conducted a repeated-measures ANOVA (α = .05) for the time. As described in the first study also here we tested our data for the assumption of sphericity and performed post hoc tests. The time analyses included only successful target selections (588 of 594 total trials, or 98.9%). Similarly to the first study, we considered all instances, where a particular technique occurred as the primary mode or refinement mode, respectively. Note that all pairwise comparisons presented below were on a significance level of p < 0.001 unless noted differently.
Regarding the primary techniques, we found that the Pointer and Hand conditions were equally fast and both faster than Head, see Figures 5 and 10. For the refinement techniques, we found that Pointer and None, were the same and faster than Hand and Head, which were also equal. Looking again at the interaction between primary*refinement techniques, see Figure 6 (• dots), we see that Hand is a particularly fast technique. It does not only provide the fastest primary mode, by which participants used to quickly select and translate the 3D object, but it also provided enough accuracy to be the overall fastest technique even without any refinement (Hand-None). In general, we were surprised about the performance of TrackPhone and our techniques even when used without any refinement (*-None). Although the task required a very precise 3D alignment of an object, participants easily managed it. From these results, we can conclude that the primary techniques alone allowed users enough precision to align the 3D object fast. However, we also need to take into account that the 3D widgets, object pre-selection, and separation of axes for manipulation, helped a lot to achieve such positive results.

Subjective Feedback
Participants expressed positive feedback for Hand-None and Hand-Hand by comments as P6: "Hand-Hand, It's very accurate, even without refinement, I would like that the refinement would be optional-triggered only once needed". Pointer-Hand was also preferred by a few participants, with arguments such as P10: "Pointer-Hand, I like this, the other way around is however very bad".

Discussion
Overall, we can summarize that our smartphone-based tracking approach enabled interaction techniques to achieve very satisfactory performance for 2D as well as 3D interactions, even without refinement and without requiring any complex tracking hardware. This was also frequently pointed out, by many participants, which explained that the basic primary techniques are in many cases already accurate enough.
Our studies showed that by using multi-modal refinement techniques, we can improve the pointing accuracy up to three times, beyond the accuracy of the primary techniques, without compromising time. We found that some techniques are better suited for primary (e.g., Head, Pointer) and others for refinement (e.g., Touch). An in-depth comparison showed that by carefully combining different multi-modal techniques, we can create highperforming techniques, which are faster and at the same time also more accurate, then all other combinations out of the same modalities, e.g., Head-Touch for 2D interaction.
We can report that Head-Touch was overall the best 2D interaction technique as it was both the fastest and the most accurate, Head and Pointer was the best primary techniques and Touch was the best refinement technique. In 3D interaction, the Hand-None technique was the fastest, Hand and Pointer were the fastest primary techniques, and None and Pointer the fastest refinement techniques.
We also learned that using the same modality for primary interaction and refinement is a good option (e.g., Pointer-Pointer or Hand-Hand). First, this works also for uni-modal systems and second, as pointed out by our participants, switching between different kinds of 3D motions done by different parts of the body (e.g., Pointer-Hand, Head-Pointer or Hand-Pointer), can cause mental efforts and slow them down. Therefore, in some situations, a uni-modal combination is more appropriate.

Applications
In this section, we present application scenarios that show the benefits of TrackPhone in for interaction with 3D content on a 2D display. Besides the object manipulation options discussed and evaluated in the study, we also explore other means of interactions that become possible through the available tracking and interaction possibilities.
The 3D Studio is an application designed to showcase the capabilities of the interaction techniques for pointing, selection, and manipulation of 3D content as seen in the second study in a more crowed setting. As shown in Figure 11a,b, users can freely rearrange and alter the interior of a room using it as a playground to test the different interaction techniques we presented in this work.
Besides this, we were also interested in how our system could be used to actively enrich the user-perception of a displayed 3D scene [76]. Due to the lack of depth-cues 3D objects/scenes can just poorly be displayed on flat 2D screens. User-perspective rendering addresses this issue but usually requires complex hardware setups making such 3D experiences inaccessible for the main-stream [77]. Due to its head-tracking capabilities TrackPhone enables user-perspective rendering without any additional hardware and thus can significantly improve the 3D interaction and experience, see Figure 11c We were further interested if the body and head motion, we capture, can be used for navigation in a virtual world. In an outdoor game scene, we map the user's movement in physical space to the virtual avatar, see Figure 11e,f. This enables users to travel through the scene [78] and adapt their view, which in combination with the user-perspective rendering from before provides an even greater sense of immersion, even if the movement is restricted by the constraints of the physical room in the current implementation.

Conclusions and Future Work
In this paper, we first presented a taxonomy of works where smartphones were used to enable distant display interaction and exposed the open challenges of the domain. Based on the key challenges as the hardware complexity and the need for rich spatial interactions, we presented a new smartphone-based tracking approach, which in contrast to previous systems, enables the transfer of beneficial interactions only known from complex smartphone + X setups to a smartphone-only system. By using the front and back cameras of the smartphone simultaneously, we enabled world-scale motion tracking of the user's hand, head, body, as well as eye-gaze. Next, we presented a user study, to validate out tracking approach and beyond that give further insights into a variety of multi-modal 2D and 3D interaction techniques. In summary, we have found out that

•
The interaction techniques enabled by our smartphone-based tracking approach achieving very satisfactory performance for 2D as well as 3D interactions, even without refinement and without requiring any complex tracking hardware. • Using multi-modal refinement techniques, we can improve the pointing accuracy up to three times, beyond the accuracy of the primary techniques, without compromising time. • Some concrete techniques are better suited for primary (e.g., Head, Pointer) and other for refinement (e.g., Touch).

•
By carefully combining different multi-modal techniques, we can create concrete highperforming techniques, which are faster and at the same time also more accurate than uni-modal techniques (e.g., Head-Touch in 2D interaction). • Certain combinations are less intuitive to use and can cause mental efforts and slow the interaction down (e.g., Pointer-Hand or Head-Pointer). • Using the same modality for primary interaction and refinement, in case that only a uni-modal system can be provided, is as well a good option for improving the interaction performance (e.g., Pointer-Pointer or Hand-Hand).
Finally, we demonstrated several demo applications that show our approach in action. Overall, we presented a powerful new tool that can change the way how researchers develop, distribute (e.g., app store), test, and finally how we will interact in the future using our own smartphones with distant displays. Furthermore, we showed how state-ofthe-art smartphone technology can be used to implement interactions far beyond touch or device-tilting interactions, still mainly considered today.
For future work, we would like to explore how our approach can be improved by supporting additional sensory modalities such as haptic [79] and auditory feedback, or input modalities as speech-input [55,80]. Furthermore, we would like to study the gaze ability of our system and body interactions, such as walk-by [37] and ego-centric inputs [81]. Finally, we plan to investigate other smartphone AR features as novel ways to generate and share the world-map (tracking features) among users by multi-user or online stored world-maps (e.g., cloud anchors), as well as other user tracking features like front and rear-camera skeleton tracking, 3D object scanning.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/mti6100094/s1, Table S1. Taxonomy showing which input modalities and interaction techniques can be enabled by the inside-out (smartphone only) tracking approach for distant display. Indicated with • are tracking abilities enabled by TrackPhone and which currently lack from known related approaches. Indicated with •• are tracking abilities enabled by TrackPhone, which currently lack from known related approaches and which we investigated in our user study. Table S2. Taxonomy showing which input modalities and interaction techniques can be enabled by the hybrid (smartphone + additional hardware) tracking approach for distant display. On the bottom, we show related scenarios that could also benefit from our taxonomy and findings. References  are cited in the supplementary materials.