Guidance in Cinematic Virtual Reality-Taxonomy, Research Status and Challenges

: In Cinematic Virtual Reality (CVR), the viewer of an omnidirectional movie can freely choose the viewing direction when watching a movie. Therefore, traditional techniques in ﬁlmmaking for guiding the viewers’ attention cannot be adapted directly to CVR. Practices such as panning or changing the frame are no longer deﬁned by the ﬁlmmaker; rather it is the viewer who decides where to look. In some stories, it is necessary to show certain details to the viewer, which should not be missed. At the same time, the freedom of the viewer to look around in the scene should not be destroyed. Therefore, techniques are needed which guide the attention of the spectator to visual information in the scene. Attention guiding also has the potential to improve the general viewing experience, since viewers will be less afraid to miss something when watching an omnidirectional movie where attention-guiding techniques have been applied. In recent years, there has been a lot of research about attention guiding in images, movies, virtual reality, augmented reality and also in CVR. We classify these methods and offer a taxonomy for attention-guiding methods. Discussing the different characteristics, we elaborate the advantages and disadvantages, give recommendations for use cases and apply the taxonomy to several examples of guiding methods.


Introduction
Omnidirectional movies (360 • movies) are attracting widespread interest and have many possible applications, e.g., telling stories about exciting experiences and locations in the world, or documenting places of historic interest. Even though the term 360 • video is widespread, it does not accurately reflect this media. On the horizontal level, there are indeed 360 • to explore, however, in full-surround videos, there are also vertical angular extents of ±90 • to be observed. Therefore, we use the term omnidirectional video, which is often used in scientific literature [1,2].
In Cinematic Virtual Reality (CVR), the viewer watches omnidirectional movies using head-mounted displays (HMD) or other Virtual Reality (VR) devices. Thus, the viewer can feel immersed within the scenes and can freely choose the viewing direction. It is possible that important details are outside the viewer's field of view. For some CVR experiences, this proves to be unproblematic-no additional guiding is necessary: The user discovers a storyline as constructed by the author [3]. In other story constructs, it can be important not to miss some elements. Guiding can prevent the user from becoming lost or confused. Often viewers express a fear of missing out (FOMO) because they do not know where to look [4,5]. In such cases, they wish to be guided in an unobtrusive way, for relaxing enjoyment without the fear of missing something.
The purposes of movies cover a wide range: entertainment, art, education, marketing or even instructions. How much guidance is needed depends to a large extent on the movie content. In some cases, guiding is advisable for the continuity of the story, for interaction cues, subtitles, education information, and for social viewing applications. Cinematic VR is not a clear lean back media [6] and several key aspects motivate guiding the viewer: Choosing the frame by looking around is a very natural way of interaction which can be enhanced by other interaction possibilities, such as interactive scene changes. Drawing attention to interactive cues can be supported by guiding methods. Another motivation for guiding arises by subtitling. Since the viewer can freely choose the viewing direction, it is difficult to identify the speaker belonging to the subtitle. Here, the viewer can be guided towards the speaker [7]. Furthermore, VR can be used for education [8][9][10][11], for example in museums or classrooms. Since guiding techniques can increase the recall rate [12,13], such methods can support the learning process. Additionally, suitable guiding methods are needed if teachers, museum guides or students want to draw attention to something. Since viewers can feel isolated watching movies via HMD, techniques are needed to support social awareness and communication. For example, guiding techniques can visualise a region of interest or the own viewport to the co-watchers [14].
Even if viewers often do not notice it, filmmakers direct the gaze and attention of the viewer to relevant aspects in the movies. Cinematic tools such as sounds, lights and movements redirect the attention of the viewer. Studies have shown that the pattern of gaze fixations is often consistent across viewers of Hollywood-style movies [15,16]. In Hollywood-style movies, filmmakers use narrative and editing techniques to strongly guide the viewers to important aspects of a scene, often at the expense of more peripheral details [15]. Since the viewer has more freedom in CVR, most of these methods are less effective.
Images in traditional movies are framed, and in the frame the filmmaker arranges elements for the story. Investigations using eye trackers showed that viewers seldom explore the periphery of the movie image [17] if there are no subtitles. People are likely to look at the central area of a frame or screen. In reality, such a frame does not exist, in Cinematic VR, the position of the frame is determined by the viewer.
To understand how gaze and attention can be guided, we inspected several models from other research fields, such as psychology and biology. We explain the terms used in these models in Chapter 2. This knowledge is important for exploring guiding methods for the field of CVR.
In the last few years, several approaches for guiding in Cinematic VR have been published [18][19][20][21][22]. Since Cinematic VR is a relatively new field of research and since it is very close to virtual and augmented reality, we also looked into concepts of these fields, as well as methods for audio-visual content on flat screens (TV, monitor) and mobile devices. There are several techniques used in other areas which are adaptable to CVR. In Chapter 3, we give an overview of published work.
Each of the techniques is focused on one or two attributes of the guiding techniques. More research is expected in the next years, and this needs consistent terms for discussing these techniques. Clarifying the concepts is also helpful for finding new approaches. To discuss which attribute of a technique was relevant for the success or failure of a method, a single overarching terminology is required. With our taxonomy in Chapter 4, we contribute structure and clarification to work on guiding in CVR.
Applying this taxonomy on known guiding methods we distinguish between 2D and 3D media. Methods used in traditional filmmaking or for images can be applied in CVR to guide the viewer in the current field of view as described in Chapter 5. VR and Augmented Reality (AR), as well as CVR, have additional needs for guiding the viewer since objects can be outside of the screen. These guiding methods are described in Chapter 6.
At the end of this work, we discuss how the introduced taxonomy provides support for the design process of guidance in Cinematic VR. The taxonomy fosters understanding of the various attributes of guiding techniques, to find new methods and support filmmakers to select the right methods for their projects.

Terms and Insights from Various Research Fields
Researchers of different areas, such as psychology, biology and computer sciences, are working on topics about attention and gaze directing [23]. Basic knowledge of these areas is necessary to understand guiding methods.

Attention Theory
There are several factors responsible for where someone is looking. On the one side, there are bottom-up factors that characterize the scene. Bottom-up factors are stimuli which attract attention due to their properties such as color or shape and are normative. Methods are normative if they are working in the same way for all people, unless a person has a specific condition such as color blindness. On the other side, there are top-down factors such as task or goal. The performance of such factors can vary between individuals. Depending on the goal, the attention can be space-based (position of an object), feature-based (features of an object) or object-based [23,24].
Movie parts can be explored in a bottom-up (stimulus-driven) or top-down (task-driven) manner. Viewers can be guided by staging and compositional techniques by using lights, colors, and focal depth. Especially the bottom-up process is responsible for the fact that viewers often do not perceive the cuts in a movie. The task, following the story, causes edit blindness [25]. This effect could be also useful for some of the guiding methods.
Cues are able to direct the attention to a target. They can have various properties and positions. Posner [26] showed that viewers detect a target faster if the cue is a feature of the target (e.g., a colored border) than a cue positioned not on the target (e.g., an arrow). He introduced the terms exogenous and endogenous. Exogenous cues are stimulus-driven and work automatically, for example, a flash that attracts attention. They cause an unintentional orientation. Such cues are positioned on the target and can also be auditive or haptic [27]. They are working in a bottom-up manner. Since the reaction to such cues is reflexive, they act fast. However, if there is no interesting target cue, the attention is transient. Endogenous cues are goal-driven and work voluntarily [27]. Often, they are based on a sign that tells where to look or to listen and require first an interpretation, e.g., an arrow. Even if goal-driven attention works slower, it enhances the processing of the event [26] and can be sustained at a location for longer periods. Yarbus [28] showed that eye movements depend on the task. In his experiment, participants watched the same scene after having been asked different questions. The eye movements differed significantly.
One of the most influential models of human visual attention, the feature integration theory, was developed in 1980 by Treisman and Gelade [29]. It explains the role of visual attention for object recognition. Perceiving a stimulus, features are registered early, automatically and in parallel. Objects are identified later in a separate process. In the first step, called the pre-attentive stage, parts of the brain automatically gather information about features such as color and shape. During the second step, the focused attention state, the whole object is perceived by combining the individual features.
Exogenous and pre-attentive processes are mostly memory-free, subtle and caused by cues at the target (direct cues). In contrast, endogenous and attentive processes are memory-bound, overt and caused by indirect cues, which have to be interpreted (e.g. an arrow). Direct cues can be outstanding features of an object. Healey et al. [30] published a list of two-dimensional outstanding features complemented by literature which describes tasks using these features. Some examples relevant for CVR are: color, size, curvature, line orientation, intensity, flicker, direction of motion, lighting direction, and intersection. Wolfe and Horwitz [31,32] also described target attributes which can efficiently guide attention: color, orientation, size, depth, motion, and luminance.
The above-mentioned terms are all relevant for characterizing the process of drawing attention. They describe related but different aspects of attention directing and are not orthogonal to each other. Table 1 gives an overview without claim of completeness. For our taxonomy, we chose the cue property, since it best complements our other extracted dimensions considering their use in informing practical design choices. Knowing about work in psychology on explaining attention directing is fundamental to understanding the various guiding methods in HCI and filmmaking. Depending on the movie genre, other guiding methods can be suitable.

Basics about Physiology of the Eyes
For studying gaze directing, knowledge about eye physiology is essential. Signals such as colors or flickers are perceived differently depending on the eye region. In the periphery, rod cells are located, which are responsible for seeing in the darkness and are very sensitive for illumination and motion. This means guiding cues at the periphery could be flickering lights or moving elements. The cone cells in the fovea are needed for seeing colors during the day. They are less responsive to light. Colors could be used for drawing the attention to an object, over which the viewer lets wander the gaze.
The diverse characteristics of periphery and fovea are the reason for the differing perception of flickers depending on the viewing direction. The critical flicker fusion frequency (CFF) is the rate, at which the flickering fuses and is perceived as continuous. The CFF is about 22 to 25 Hz for low lights (rod cells). For higher light intensity (cone cells) the CFF increases to the logarithm of the light intensity (Ferry-Porter law) [33] and increases up to 80 Hz depending on the area of the light intensity (Granit-Harper law) [34]-the larger the flickering stimulus, the higher the CFF. This is the reason for different critical flickering fusion rates (CFF) in different regions of the eye. Thus, a high-frequent flicker can be visible in the periphery but fused in the fovea. Additionally, the temporal resolution acuity increases for larger flickers ( [35]). This property can be used for subtle gaze direction. However, the CFF is a very sensitive attribute and depends not only on the regions of the eye but also on [34,35] CFF can be used for creating stimuli cues and developing guiding methods. Some research has been done which used the CFF for gaze directing in images [36]. However, such methods cannot be easily adapted to movies, since alteration in the movie image can make the flickering stimuli ineffective and the threshold has to be increased, making the method no longer subtle [13]. Even if it is difficult to consider all the mentioned parameters at once and to find CFF thresholds valid for each person, the knowledge about this behaviour has to be taken into account when analysing and developing guiding methods. Flickers can not only be designed as an artificial part of the method, but they can also be included in the movie (e.g., flickering lights).

Guiding Methods in Literature
There is a lot of research about gaze guiding in images, traditional movies, VR, AR, and CVR. Table 2 gives an overview of the methods and the environments in which they were tested. We inspected not only the methods evaluated in CVR environments since techniques of other fields could be adaptable to CVR, even if it needs closer inspection and adjustments. Some of them we have already evaluated in previous work, such as diegetic methods [22] and Subtle Gaze Directing (SGD) [13]. It depends on the type of movie, whether guiding methods are needed and how strong or subtle the guiding should be. More obvious methods perform better but decrease the experience [19]. Since gaze guiding can increase the recall rate of target objects [12,13] also, the aim of the movie is relevant for finding the most suitable technique. A guiding method can be important in an educational CVR application, but disturbing in a meditative movie. Table 2. Overview of guiding methods from research projects and the environments for which they were tested. The last column indicates the name of the method in the paper. In some of the studies, eye-tracking (ET) was used additionally.  "GCD" = "Gaze-Contingent Display"; "ET" = "eye-tracking".

Project
Some of the most important guiding methods of different research fields we inspect deeper:

Diegetic Methods
Some research in recent years has focused on diegetic methods for guiding the viewer in CVR [18,21,22,86]. Diegetic cues are part of the movie, for example moving characters, lights or sounds. The concept of diegesis in film theory was developed by Souriaus [87] and afterwards adapted to other fields, e.g., literary theory. Diegetic elements belong to the narrative world. The term diegetic is well-known in film theory and is mostly used for music and other sounds. Diegetic music in a movie is part of the story. It can be heard not only by the viewer (like film music) but also by the characters. Examples are: music from a radio in the movie or music played by musicians which are movie characters. For guiding the viewers' attention, the authors can use diegetic cues which are included in the story world: moving protagonists, lights or sounds.
On the other side, there are non-diegetic cues which are not part of the story, such as arrows, focus assistant tools or forced rotations methods [88,89].

Salience Modulation Technique (SMT)
Mendez et al. [44] described a Salience Modulation Technique (SMT) for directing the viewer's attention to a target object. For analysing the original images, saliency maps were used and the material was modulated depending on the results of the analysis. A saliency map shows the saliency values on each region in the image [90,91]. Thus, it was possible to apply minimal changes. This method works on video in real time. However, it can only be used if the target is in the viewer's field of view. Additionally, the method is less strong in environments with moving and blinking objects. Veas et al. [45] investigated SMT regarding modulation awareness, attention, and memory. They showed that SMT can shift the attention to selected targets without the viewer noticing the modulation. Moreover, SMT can increase the recall rate.
There are promising approaches in other fields of research. Gaze-Contingent Displays (GCD) reduce the resolution of peripheral areas to decrease the amount of data [92,93]. The region where the user is looking is determined by eye tracking and shown in high-resolution quality. Since saliency modulation techniques can guide the gaze [1,45,66,77,80,94], the techniques could be transferable to Cinematic VR. However, saliency modulation can only be effective if the modulated region is in the field of view (FoV) of the viewer.

Blurring
Another method is using blurred regions in the images. Smith et al. [64] investigated blurred and non-blurred regions for guiding the viewer's attention. They showed that the viewer tends towards regions with little or no spatial blur if the rest of the image is more blurred. This approach is very similar to methods used in traditional filmmaking. Hata et al. [65] extended this method for visual guidance with unnoticed blur effects for images. A threshold was found at which the viewer notices the blur and could be guided below it.

Stylistic Rendering
Guiding the viewer in a movie can also be done by stylistic elements: depth of field, colors, brightness, and sharpness. Cole et al. [42] investigated gaze direction in 3D models with stylized focus. They used local variations in shading effects (color saturation and contrast) and line qualities (texture and density) for drawing the viewer's gaze to the emphasized area. Additionally, they applied a dynamic technique: stylized focus pull. Focus pull is a creative camera technique in traditional filmmaking where the focus changes during the shot and so the attention switches from one area to another. In digital editing, focus-pulling can also be added in the post-production by an animated filter effect.

Subtle Gaze Direction (SGD) with Eye Tracking
Subtle gaze direction can guide the gaze without the viewer noticing it. The concept of subtle gaze direction (SGD) was first presented by Bailey [59]. The core of this concept is to modulate the target region if it is in the peripheral area for inducing the viewer to look there and to stop the modulation when the viewer is watching in this direction. In this way, the viewer's gaze can be guided without perceiving the modulation. In the research of Bailey et al. [59], two options were investigated: luminance modulation and warm-cool modulation, both with a rate of 10 Hz in a circular region of approximately 1-cm diameter. An eye tracker was used to observe the viewer's gaze and the modulation stopped when the viewer changes the view in the direction of the target region. It was shown that this technique can effectively guide the user's gaze in still images without the user noticing the modulation.
In the experiments of McNamara et al. [60] modulation of the luminance was more effective than warm-cool modulation. They showed that SGD improves the performance for search tasks in images without the participants noticing the modulation. The same method was investigated for guiding in narrative art, with static images [62]. Grogorick et al. extended this method to virtual environments [61]. A luminance modulation was used and a circle shape was dynamically adapted to ellipses for the wide FoV in VR. Additionally, the stimulus was dynamically positioned, so the method could be used also for targets, which are not in the FoV at the beginning of the stimulation. The experiments showed that results of search tasks can be improved for hidden objects.

Subtle Gaze Direction (SGD) with High Frequent Flickers
There is some research on how SGD can be used without an eye tracker. Waldin et al. [36] took advantage of the fact that the peripheral vision is more sensitive to highly frequent flickering than the foveal vision. Therefore, the critical fusion frequency (CFF) is different in these areas. A signal flickering in the periphery is no longer flickering when the viewer is looking at it and the signal is in the fovea. If the viewer is looking in the direction of the flicker, the flicker fuses to a stable image. In this way, no eye-tracking is necessary for stopping the modulation. The experiments used flickers of 60 Hz and 72 Hz, so a display of 120 Hz and 142 Hz was needed. In the first experiment, images with cycles where used, in the second experiment a highly complex image. The method worked effectively in both cases. It seems to be necessary to execute a personal calibration routine to find the size and luminance of the flicker modulation. At the moment, this method cannot be adapted to VR and CVR since the frequency of the HMD displays (90 Hz) are not high enough. Additionally, flickers are less effective in environments with dynamic changes. As mentioned in Chapter 2, the CFF does not only depend on the region of the eye region, and the thresholds are difficult to find.

Off-screen Indicators (Halo, Edge)
Since the viewer in CVR observes only an extract of the film image via HMD, the above methods are not always effective. Depending on the viewing direction, cues can be missed since they are not in the viewer's FoV. Therefore, methods are needed to indicate targets beyond the screen.
One way of visualizing off-screen objects on flat displays is the halo technique [95], where off-screen objects are surrounded by circles the size of which is sufficient to be visible at the edge of the display. From the curvature of the circle, the user can infer the position of the object. The halo method is not directly transferable to CVR because the CVR screen is a sphere. To ensure that a circle is still visible on the edge of the display, the center must not be more than 90 • away. For points outside this region, for example on the opposite side of the gaze cursor, the circle cannot be made visible in the display. EdgeRadar [96] and Wedge [97] are modifications of this technique to avoid overloading and overlapping. These techniques were adapted to mobile AR [48] and for HMDs [49]. Gruenefeld et al. [98] compared several off-screen object visualization techniques (Arrow, Halo and Wedge) for out-of-view objects in Augmented Reality. In their experiments, the halo and wedge technique performed best. However, the implemented methods were limited to a 90 • area in front of the user and need further adaption to 360 • . EyeSee360 [50] is a visualization technique for out-of-view objects in Augmented Reality which could be adaptable for CVR.

Forced Rotation of the user (SwiVRChair)
Gugenheimer et al. [37] developed a chair which automatically rotates the viewer to look at predefined regions of interest. In their experiment, simulator sickness was very low. This may be caused by the fact that the viewer was turned around and so the rotation in the virtual world matched with the rotation in the real world. Additionally, the participants needed lower head movements for enjoying the VR experience in a more "lean back" way.

Forced Rotation of the VR world
Another possibility of forced guiding is to rotate the scene in a way that the region of interest (RoI) is in the field of view of the viewer. Nielsen et al. [21] compared forced rotation with diegetic guiding. In their experiments, the diegetic method was more helpful and caused higher presence. Lin et al. [20] compared forced rotation (called autopilot) with an arrow which points to the direction of the RoI (called visual guidance). The results depended on the type of movie, but no generally higher sickness was observed for the forced rotation. One reason for simulator sickness is the discrepancy between movements in the real and virtual world [99,100], and so rotating the VR world in front of the user often provokes sickness. There is no consistent opinion if rotating a scene causes simulator sickness or not [20].

Forced Rotation via Cutting
In traditional filmmaking, cutting can be used to show important details to the viewer. After the cut, the RoI is displayed. The same can be done in CVR: Independent of the viewing direction, the viewer will see the RoI after the cut [39,101]. However, it needs to be investigated if this can cause disorientation in case both scenes are in the same location and the viewing direction changes with the cut-similar to the crossing the line problem [102] in traditional movies.

Haptic Cues
Kaul et al. [68] developed HapticHead for guidance in virtual and augmented reality. Chang et al. [38] introduced FacePush, a system for haptic signals using HMDs. Their system generates forces on the face of the viewer and was tested for two VR experiences (boxing, diving) and for CVR guiding. In contrast to HapticHead or a vibrotactile headband [69], FacePush indicates the advised direction of rotation (left/right) and not the absolute direction of the RoI. For integrating haptic cues in a story as diegetic cues, it needs haptic stimulus on other parts of the body beyond the head. Drones can provide such haptic stimuli [103,104].
Summarizing all these methods, we found several properties of guiding techniques investigated in the literature: subtle, off-screen, forced, diegetic, haptic, and some others. Not all of them are comparable to each other, since they highlight different aspects of the guiding method. It is important to classify these attributes for finding the most relevant and qualified characteristics for guiding methods in CVR. We will do this in the next chapter.

Taxonomy
To find the appropriate techniques for guiding in CVR, we inspected methods for various media: images, movies, virtual and augmented reality (Chapter 3). Inspired by the large amount of papers about guiding methods and several taxonomies in Virtual Reality [21,105,106], we analysed these methods and classified their properties, also taking into account that they might be combined across papers: For example, even if one paper emphasised the subtleness of a method, that method might also potentially address visual/auditive/haptic senses in future work, or might be investigated for on-or off-screen targets. Even if a paper emphasizes the subtleness of a method, the method can be additionally visual/auditive/haptic and on-screen/off-screen. In this process, we took into account if a dimension is needed in CVR.
With our classification, we found seven orthogonal dimensions. Nielsen et al. [21] described three dimensions for attention guiding. One of our dimensions (diegesis) is consistent, two others correspond to their taxonomy (directness and freedom).
Our taxonomy describes the most important attributes which we discovered in the literature and which are relevant in our own work without claim of completeness. It is conceivable that in the future new components should be added depending on the focus of research. Table 3 presents our taxonomy of important dimensions. They will be explained in the following subsections.

Diegetic and Non-Diegetic
Research results show that diegetic methods perform well in Cinematic VR [18,21,22]. For visual diegetic methods, the cue has to be in the field of view, e.g., movements, light, colors. In most cases, the location of the cue (e.g., the color of the target) will be identical to that of the target. One exception is: a protagonist looks or points into a certain direction.
However, one can imagine story parts where no suitable cues in the story world exist. If it is nevertheless necessary to guide the attention to a detail, non-diegetic methods can be applied. Depending on the use case, the method either has to be designed to be noticed easily or to avoid disturbance. Some advantages and disadvantages of diegetic and non-diegetic methods are listed in Table 4.

Visual, Auditive and Haptic
It is obvious to discuss visual methods for attention-guiding in CVR. Movements, lights and characters are well known for drawing attention in traditional movies [107]. However, these cues can only be used if they are in the field of view. If the viewer is looking in another direction, the cues will not be discovered. For motivating the user to change the viewing direction, sound coming from the direction of the Point of Interest (PoI) is a considerable method, since it can be used out of the field of view. Even if the source of a sound is not visible, it is possible to hear it-including the direction of the sound. In real life, a source of noise can get someone to change the viewing direction. The same is true for CVR [22]. Also, haptic cues can cause this behavior and it is worthwhile to discuss it as a guiding method. Some advantages and disadvantages of visual, auditive and haptic methods are listed in Table 5.

On-and Off-screen
Depending on the viewing direction, a PoI can be in the FoV of the viewer or outside of it. To guide the attention to an object on the screen, methods can be discussed which are already investigated for images or traditional movies. We call this on-screen guiding. However, in CVR it can happen that the viewer first has to change the viewing direction for seeing the PoI on the screen. For this case, off-screen methods are needed.
Which of both methods is used does not depend on the author, rather on the viewing direction. The author has to decide if both are needed. Visual methods such as saliency modulation of the RoI can only work if the region is in the FoV. If the viewer should not miss it, an off-screen technique has to be added.

World-and Screen-referenced
Cues in VR can be differentiated between screen-referenced and world-referenced indicators [108,109]. Screen-referenced items are connected to the display and move along with it in case the viewer is turning the head. World-referenced items are connected to the virtual world, in our case to the movie. They stay fixed at their place in the movie world, even if the viewer turns the head. The term "screen-referenced" corresponds to the notion "in-view" used in augmented reality and "world-referenced" matches "in-situ" (e.g., Reference [110]).
Even if diegetic cues are world-referenced, the opposite is not true. A cue added on top of the movie for guiding the viewer, which cannot be seen by the movie characters, is non-diegetic. Screen-referenced cues are always non-diegetic since they cannot be part of the story world (movie). They are well suited for menus. Some advantages and disadvantages of world-referenced and screen-referenced methods are listed in Table 6.

Direct and Indirect Cues
There are two main types of cues: direct and indirect cues [111]. Direct cues are at the target, e.g., outlines, colors or lights. Indirect cues are based on symbolic information and have first to be interpreted, for example an arrow. The cues do not have to be visual. For example, a sudden bang can work as an auditive direct cue and a voice, that says what can be seen, as an indirect cue.
Direct cues work mostly stimulus-driven and are based on the characteristics of the scene (exogenous, memory-free), e.g., an abrupt light or sudden movements and working in a bottom-up manner. For that, regions of interest have to be sufficiently different from the surroundings. Direct cues act fast, transient and spontaneous [24,112].
Indirect cues involve a conscious effort (endogenous, memory-bound), e.g., interpreting a sign. They work in a top-down manner by cognitive properties such as knowledge, expectations and tasks. Indirect methods are slow, sustained and voluntary [24,112,113]. Some advantages and disadvantages of direct and indirect cues are listed in Table 7.

Subtle and Overt
In case the user is not aware of a method, the method is called subtle. In contrast to this, overt techniques will be noticed by the user [106]. There are several subtle guiding techniques, which are based on the physiology of the eye, and the term Subtle Gaze Guiding (SGD) is already established for these methods. However, the term subtle is not used consistently in the literature. The term subliminal is also common for stimuli, below the threshold for conscious perception [65]. As already mentioned, such thresholds (e.g., CFF) depend usually on several factors and vary between people. Thus, a cue can be subliminal for one person, but supraliminal for another.
Subtleness can also be achieved otherwise. Examples are diegetic methods, where elements of the movie guide the gaze. The user notices the cue but is normally not aware of the guiding property. Even if subtlety of techniques can be defined as a continuum, we agree with Suma et al. [106] in choosing a dichotomical categorization (subtle vs overt), whereby subliminal is included in subtle.
Some advantages and disadvantages of subtle and overt methods are listed in Table 8.  [59] easily noticeable [106] can increase recall rates not always effective [13] can be disrupting -> suitable for wide story structures suitable for learning task [12,13]

Forced by System, Forced by Reflex and Voluntary
Most of the discussed methods are voluntary: Viewers can freely decide if they follow any guiding cues or if they explore the scene on their own. However, also forced methods can be applied [20,21,37,70]. There are different ways of forced guiding. On the one hand, the viewer can be rotated, as in SwiVRChair [37]. This has the advantage that the viewer can feel the rotary motion. On the other hand, the VR-world/movie can be rotated [20]. These methods force the user to change the viewing direction in a technical way. This can also be done by using the methods based on the physiological models described in Chapter 2. Stimuli can provoke the viewer to change the viewing direction reflexively in a fast way. Some advantages and disadvantages of forced and voluntary methods are listed in Table 9.

Usage of the Taxonomy for CVR Guiding Methods from Literature
The previous section described our identified dimensions. Now, Table 10 shows methods from literature for guiding in CVR and their attributes in the introduced taxonomy. It shows that less subtle methods have been studied so far. We could find only one haptic guiding method for CVR likely due to the state of technology. Most literature about guiding in CVR is concentrated on off-screen guiding since this is one of the challenges of this medium. It is expected that methods from traditional movies work if the RoI is in the field of view. However, it needs some effort to find the best way to implement them in CVR. "dieg." = "diegetic"; "scr." = "screen"; "forced sys" = "forced by system"; For "forced sys" methods, no sense is assigned (/) since this method cannot be influenced by the users' sense.

Methods for CVR Adapted from Guiding in Traditional Movies and Images (2D)
In this chapter, we present well-known guiding methods used in traditional movies or images and classify them according to the taxonomy of chapter 4. These methods can be used as on-screen methods in CVR. Using the taxonomy, differences and similarities of methods could be found and unique characteristics identified.

Diegetic Methods Diegetic, Visual/Auditive, on-Screen/off-Screen, World-Referenced, Subtle, Voluntary
In traditional filmmaking, movements, sounds or lights included in the story can guide the viewers' attention [107]. Such diegetic techniques can guide the gaze also in Cinematic VR if they are in the field of view (on-screen). Even if most diegetic methods are on-screen techniques, there are some exceptions, where such methods can be applied to off-screen guiding: • Diegetic visual cues: If a person is looking in a direction out of the screen, the viewer will mostly follow it [18]. The same is true for moving objects [22]. • Diegetic auditive cues: Sounds motivate the user to search for the source of the sound and therefore to change the viewing direction [22].
Diegetic cues are subtle and the viewer is free to follow them. Due to the nature of diegetic methods, they are always world-referenced. Non-diegetic methods can be both world-referenced or screen-referenced. Diegetic cues are mostly direct, however, exceptions are conceivable, for example a person speaking about an object in the room.

Image Modulation non-Diegetic, Visual, on-Screen, world-Referenced, Subtle/overt, Voluntary
Image modulations, such as changing color, saliency or saturation, are mostly non-diegetic, visual, on-screen, world-referenced, and voluntary. If the modulation is subtle or overt depends on the degree of modification. Salience modulation, as well as blurring, are effects which are used in traditional movies for guiding viewers' attention. Danieau et al. [19] applied them to CVR and compared four video effects for CVR: (1) fading-to-black for the area out of interest, (2) desaturation (like SMT), (3) blurring, and (4) deformation by displaying a wavelike effect on the side of the viewer's field of view. In an informal user study, blur and deformation were not successful in guiding. Comparing fading-to-black, desaturation, no guiding, and forced rotation in the main study, they found a trade-off between the efficiency and noticeability of the effects. They were either disturbing (fading-to-black) or ineffective (desaturation). In some of our user studies, we made similar experiences [13]. Deformations were either not subtle or not working. We think that for blurring methods, the resolution of movies and displays are not high enough for noticing a relevant difference between the blurred and non-blurred area.

Overlays non-Diegetic, Visual, on-Screen/off-Screen, World/Screen-Referenced, Overt, Voluntary
Overlays, such as arrows, are indirect indicators. It requires interpretation to find the right direction. Such methods are well-known on flat-screens, but also available in VR environments. Lin et al. [20] compared an arrow with a forced rotation (autopilot). Both methods are very obvious and were evaluated for a sports video and a city tour. Forced rotation is suitable in cases where the viewer needs to see a detail in time whereas an arrow indicates something or gives hints.

Subtle Gaze Direction non-Diegetic, Visual, on-Screen/off-Screen, World-Referenced, Subtle, Voluntary
Subtle methods for gaze direction (SGD) were investigated for static images on flat displays [12,60]. Such methods can improve the success in search tasks [60] and reduce the error rate in remembering regions and their locations [12]. To extend these methods to CVR or VR, there are several issues to consider: • A method developed for images works in a static environment. The remaining part of the picture does not change. This is not the case for videos. • A method developed for clear test environments sometimes might not work for complex images or videos with a lot of objects competing for attention.

•
A method developed for a monitor has to be extended for the case where the target object is not in the FoV. • A method using flickering must take into account the frame rate of the movie and the HMD.
We tested subtle gaze directing for CVR [13] and achieved similar results as Danieau et al. for video effects [19]: searching the right parameters for the method resulted in a technique which either was not subtle or did not work well. SGD which is working well for still images is difficult to adapt to CVR. That may be because of the available hardware. Depending on the used type of SGD, high display frequencies or a wide field of view are necessary. For using the different sensory perception of fovea and periphery of the eye, the FoV of an HMD does not seem to be large enough. To adapt high frequency subtle flickering methods, the frequency of the HMD display needs to be higher. On the other side, movements in the movie might render subtle cues ineffective. However, we could find a higher recall rate with (non-subtle) flickering.

Methods for CVR adapted from VR and AR (3D)
Following the above taxonomy, we present several known methods from VR and AR and classify them according to the taxonomy. For each method, we discuss if and how it can be adapted to CVR.  [20,51,53,54]. They work well but can be disturbing. Augmented Reality methods such as attention funnel [81,114] or ParaFrustum [82] could be suitable for instruction or education application. Both are realized by drawing augmented elements, which start at the viewer's eyes and lead to the region of interest. The methods are overt and usable for only one PoI. Since the overlay partially covers the RoI, it is less suitable for CVR movie experiences.

Stylistic Rendering non-Diegetic, Visual, on-Screen, World-Referenced, subtle/Overt, Voluntary
Stylistic rendering methods [42] for 3D models are similar to image modulation methods described in 5.2. and can be adapted to CVR. To find the perfect rendering style for each target can be a creative part of CVR filmmaking. However, for noticing such an effect, it has to be in the field of view (on-screen).
6.3. Picture-in-Picture Displays non-Diegetic, Visual, off-Screen, Screen-Referenced, Overt, Voluntary All methods described so far indicate the direction of the RoI. In contrast, showing the RoI in a small inline-window (Figure 1b, example from our work) at the screen offers the advantage that the viewer knows what to expect and thus can decide if the viewing direction should be changed. One disadvantage of this method is that the window covers a part of the content. The other drawback, the missing information about the position of the RoI, can be solved by placing the display on the side near the RoI.
Lin et al. [20] evaluated this method for omnidirectional movies on mobile phones. The method outperformed arrow-based guidance for most aspects, even if it occupied more space. be a creative part of CVR filmmaking. However, for noticing such an effect, it has to be in the field of view (on-screen).

Picture-in-Picture Displays non-Diegetic, Visual, off-Screen, Screen-Referenced, Overt, Voluntary
All methods described so far indicate the direction of the RoI. In contrast, showing the RoI in a small inline-window (Figure 1b, example from our work) at the screen offers the advantage that the viewer knows what to expect and thus can decide if the viewing direction should be changed. One disadvantage of this method is that the window covers a part of the content. The other drawback, the missing information about the position of the RoI, can be solved by placing the display on the side near the RoI.
Lin et al. [20] evaluated this method for omnidirectional movies on mobile phones. The method outperformed arrow-based guidance for most aspects, even if it occupied more space.
(a) RoI shown by an arrow.
(b) RoI shown by a display (PiP). Methods used by sailplanes for collision avoidance systems could be used to show the RoI. Such systems show from which direction another sailplane comes. We implemented a method to indicate the direction of the PoI (Figure 2a). The bar at the bottom shows if the PoI is on the right or on the left

Radar non-Diegetic, Visual, off-Screen, Screen-Referenced, Overt, Voluntary
Methods used by sailplanes for collision avoidance systems could be used to show the RoI. Such systems show from which direction another sailplane comes. We implemented a method to indicate the direction of the PoI (Figure 2a). The bar at the bottom shows if the PoI is on the right or on the left side. The bar on the right shows if the PoI is higher or lower than the own viewing direction. Another example can be seen in Figure 2b, where the direction is shown by a circle and the height by a bar.
(a) RoI shown by an arrow.
(b) RoI shown by a display (PiP). Figure 1. Two overlay methods for CVR (Images taken from our work).

Radar non-Diegetic, Visual, off-Screen, Screen-Referenced, Overt, Voluntary
Methods used by sailplanes for collision avoidance systems could be used to show the RoI. Such systems show from which direction another sailplane comes. We implemented a method to indicate the direction of the PoI (Figure 2a). The bar at the bottom shows if the PoI is on the right or on the left side. The bar on the right shows if the PoI is higher or lower than the own viewing direction. Another example can be seen in Figure 2b, where the direction is shown by a circle and the height by a bar. (a) The bar at the bottom shows if the PoI is to the right or left. The bar on the right shows if the PoI is higher or lower than the own viewing direction.
(b) The direction is shown by a circle and the height by a bar (radar circle method).

Figure 2.
Collision avoidance methods of sailplanes transferred to indicate the direction of the PoI. In both cases, the direction is on the left side behind the viewer, a bit below the own viewing direction.

Practical Considerations when Applying the Taxonomy
The introduced taxonomy supports researchers and practitioners in designing guiding methods for Cinematic VR. This chapter connects the dimensions with the design questions of the filmmaker. When developing a CVR experience, filmmakers know their material and can think about the desired effect of a guiding method and about its attributes. To make decisions about the most appropriate

Practical Considerations when Applying the Taxonomy
The introduced taxonomy supports researchers and practitioners in designing guiding methods for Cinematic VR. This chapter connects the dimensions with the design questions of the filmmaker. When developing a CVR experience, filmmakers know their material and can think about the desired effect of a guiding method and about its attributes. To make decisions about the most appropriate guiding technique, various aspects are relevant and are captured by answering the following questions: Answering these questions is the first step in finding the right technique. Tempo: In traditional movies, the filmmaker can determine the pace by cutting and showing image sections for a short or long time. In CVR, the user explores the scene by changing the viewing direction in its own tempo. The filmmaker can influence this process by choosing the right guiding method. To affect the pace of a movie, one can choose between slow-and fast-acting methods. For example, forced methods work very fast, but they can destroy the experience. However, there are CVR movies imaginable, where a forced rotation is part of the experience.
Effectiveness: It depends on the purpose of the movie if effectiveness is more important than presence. More obvious techniques often are more effective, but they can destroy the movie experience. For a relaxing movie event, it might be less important to ensure that viewers always follow the guiding. In that case, diegetic, voluntary methods should be preferred. In contrast, for a sport event, it can be substantial to see the details at the right moment. Here, forced or stimulus-driven methods could be a good choice. For instructional movies, an arrow fits perfectly. It is an overt technique using an indirect cue, hard to overlook, but viewers remain free to follow it or not.
Recall Rate: In case the CVR movie is used for learning applications, the recall rate can be important, and more obvious methods can be applied, such as overt, non-diegetic ones. Also, for other genres, it can be relevant that the viewer remembers details of the story, but the indicator should not be so obvious. Based on attention theory, voluntary processes are memory-bound, keep the attention longer, and thus increase the recall rate. Also, modulation techniques (e.g., stylistic rendering, SMT) are applicable since they influence memory [45].
Number of RoIs: The above methods are mostly evaluated for a single RoI. Especially for nonlinear storytelling, more than one RoI may be required at the same time. Not all of the mentioned methods are able to handle this case. Although some methods are able to manage more than one RoI, this can lead to overcrowding the display and overtaxing the viewer. It needs more research for finding methods and adjustments to handle more than one RoI simultaneously.
Covering: Diegetic or modulation methods do not cover movie content. When using indicators, such as arrows or halos, parts of the movie are not visible. This can be disturbing. However, purposes are conceivable where the indicator is more important than complete visibility, e.g., for instruction videos. Choosing the right parameters (size or color) can make the cue more obvious/effective or subtler. World-referenced overlays (e.g., arrows) stay at the same place in the movie and cover an area permanently. Screen-referenced overlays (such as signs at the display edge) change the covered area if the head is moving.
Clutter: To be able to assess whether a method will be suitable, the complexity of an image must be considered. Cluttered images with a lot of details require clear and obvious techniques. If the image is clear, more subtle methods can work. The same is true for audio: If there are a lot of sounds in the movie, it can be difficult to follow a spatial audio signal which should guide to a RoI.
Experience: It is not always necessary to make guiding imperceptible, it can be also part of the experience, in the same way, as scene transitions influence the movie. The technique can affect the style, the pace and the atmosphere of a movie.
Overall, the taxonomy above provides support in the process of finding the most suitable technique to address these practical questions.

Conclusions
Since in Cinematic VR the viewer can freely choose the viewing direction, the selection of the visible image section is no longer defined by the filmmaker, but by the viewer. This can cause problems if the viewer misses an important detail of the story. Also, the viewing experience can suffer because the viewer is afraid to miss something. Additionally, an important aspect of influencing style and pace rests no longer exclusively in the hands of the filmmaker. On the other hand, CVR provides a lot of new opportunities. With the added space component, non-linear, interactive stories, intuitive for the viewer, can be realized. Guiding the viewer in such experiences is not only a requirement, but it is also a chance for the filmmaker to influence the style and pace in novel ways. It may be used like transitions in traditional movies-the filmmaker chooses the best fitting techniques for each Region of Interest. Also, changing between the methods within a movie could be useful, e.g., for changing the pace of certain movie sections.
Based on previous literature, we described a taxonomy for guiding methods. Our focus was on CVR, yet most dimensions are transferable to augmented and virtual reality. Classifying these methods corresponding to the taxonomy assists researchers and practitioners in finding the right technique for different requirements. We listed the advantages and disadvantages of attributes along an identified set of key dimensions. We illustrated each such dimension with concrete examples of guiding methods.
This taxonomy can help to understand the various characteristics of guiding techniques, to find new methods which have not yet been analysed and support filmmakers to find the right methods for their projects.