Interactions in Augmented and Mixed Reality: An Overview

: “Interaction” represents a critical term in the augmented and mixed reality ecosystem. Today, in mixed reality environments and applications, interaction occupies the joint space between any combination of humans, physical environment, and computers. Although interaction methods and techniques have been extensively examined in recent decades in the ﬁeld of human-computer interaction, they still should be reidentiﬁed in the context of immersive realities. The latest technological advancements in sensors, processing power and technologies, including the internet of things and the 5G GSM network, led to innovative and advanced input methods and enforced computer environmental perception. For example, ubiquitous sensors under a high-speed GSM network may enhance mobile users’ interactions with physical or virtual objects. As technological advancements emerge, researchers create umbrella terms to deﬁne their work, such as multimodal, tangible, and collaborative interactions. However, although they serve their purpose, various naming trends overlap in terminology, diverge in deﬁnitions, and lack modality and conceptual framework classiﬁcations. This paper presents a modality-based interaction-oriented diagram for researchers to position their work and deﬁnes taxonomy ground rules to expand and adjust this diagram when novel interaction approaches emerge. modality-based and interaction-oriented diagram of the reviewed work. We present a new approach for classifying HCI for immersive realities by interrelating modalities (audio-based, visual-based, haptic-based, and sensor-based) with their context and methods. The main scope of this study is to present and organize


Introduction
Significant efforts have been spent, in both basic and applied research, to highlight the importance of human-computer interaction (HCI) on the end-user experience in augmented and mixed reality (AR)(MR) environments [1,2]. To a large extent, research focuses on the user capability to perform tasks and interact with the virtual world, assisted by various functions and control systems. User-centered system design (UCSD), first described by Kling [3] and later by Norman [4], generally focuses on the user's understanding of a system. It examines what the user expects to happen, and how to perform a task or recover from an error, presenting HCI as a communicative and collaborative process between humans and machines. Exploiting immersive realities and the UCSD radically changed the way humans perform everyday tasks or perceive historical and cultural information. Mixed and augmented reality finally occupy significant space in our daily routine. The achievement of several historical milestones, from routing [5] to entertainment [6], and from social media [7] to engineering and remote collaboration [8], showcases the promising future of AR and MR.
Technological achievements in AR and MR environments have made possible interactive visualizations of previously unexplored virtual and real-world combinations. According to Milgram et al. [9], the MR environment is where the real and the virtual coexist. Coutrix et al. [10] described a mixed object as a real object with a virtual equivalent. Recently, Evangelidis et al. [11] defined the MR ecosystem, strictly separating it from AR, by introducing geospatial modalities and implementing the concept of mixed objects, thus achieving spatial and context awareness among realities. Over the past 25 years, approximately since the introduction of the well-known reality-virtuality continuum [9], published research work and applications have profoundly changed the way humans perceive and interact with historical [12][13][14][15][16], future [17][18][19][20][21], and imaginary [22][23][24] reality scenarios. However, the latest research findings and innovations in review papers regarding interaction methods are not classified under a well-defined framework, thus leading to misconnections and ambiguities. For example, having the taste, smell, and haptic modality enclosed by the sensor-based modality, a system that utilizes all of them would still be, by definition, unimodal. Bunt [25] and Quek [26] both stated that the interaction between this world and humans is naturally multimodal [27]; therefore, overviewing AR and MR HCI in the light of how humans perceive reality, through sensations, might improve creative thinking and provoke novel interactions. Categorizations commonly taking place are based on the field of application (tourism, architecture, medicine) the device of application (mobile, desktop) [28], or umbrella terms (multimodal, tangible, collaborative) [29] and without focusing on the modality or the context of interaction. As a result, an inaccurate representation of available interaction methods can affect the creative thinking of future researchers and act counterproductively concerning their efforts in the field of HCI. Previous attempts of listing or categorizing the components of HCI for augmented and mixed reality [30][31][32][33][34][35][36][37][38][39] reveal that a clear, in-depth taxonomy of interactions either does not exist or is not widely known to the scientific community.
Mixed reality is an ever-evolving field, and novel approaches and innovative applications could delineate new interaction methods. HCI is associated with established theories, such as the theory of action described in Norman's book The Design of Everyday Things [40], the theory of communication [41], the theory of modalities [42], and the theory of perception [43]. Interestingly, although the theory of modalities positions haptic together with audio and visual modalities, the categorizations mentioned above include it within the sensor-based modality, together with taste and smell. The taxonomy proposed herewith is much more of an overview and a first attempt to expose all modalities with their interactions in the first level. We expect that the taxonomy we propose will better organize existing interaction methods, present a complete view of what has been accomplished so far and define a set of ground rules regarding naming conventions. Pamparau and Vatavu, in a recent position paper [44], stressed several issues to the community related to user experience (UX) and HCI in AR and MR environments, one of which was to structure design knowledge for the UX of interactions. A well-defined classification framework needs to exist for this to happen, and the interaction challenges need to be known to test UX [45]. In our understanding, interactions in immersive environments reveal three fundamental challenges. First, users need to naturally interact with machines to perform main interaction tasks, such as selection, manipulation, navigation, and system control. The interaction method should be as intuitive as possible to produce natural interactions, as any disturbance of the user's attention may detract from the immersive experience [46]. Secondly, current technological limitations in positional accuracy in such hybrid environments may cause spatial misalignments [47,48] or dislocations [49]. Accurately determining the end user's position [50] is crucial for successfully visualizing an MR environment, and technical challenges regarding coverage emerge. Finally, for interactions to be as "real" as possible, there should be a semantic context connection among involved realities. Based on the abovementioned, this paper aims to analyze research in HCI for immersive realities and mobile environments and to give an overview of what has been done so far by presenting a classified representation as part of a modality-based and interaction-oriented diagram of the reviewed work. We present a new approach for classifying HCI for immersive realities by interrelating modalities (audio-based, visual-based, haptic-based, and sensor-based) with their context and methods. The main scope of this study is to present and organize the distinct interaction methods and organize them in a well-defined and structured classification that provides more depth and accuracy concerning how modalities are being used, in what context, and with what method. This innovative classification model is the outcome of a thorough study of pertinent research, as well as the result of a methodological investigation of the optimal way to structure the categories so that the approaches employed in the surveyed papers would be presented in a consistent, precise, and more meaningful way. This paper is structured as follows: Section 2 presents a detailed explanation of the taxonomy ground rules and highlights the categories with their definitions based on which this paper organizes research findings. Then, Section 3 introduces a brief review of the state-of-the-art interactions and an in-depth review of the visual, audio, haptic, and sensorbased modalities. At this point, it is worth mentioning that, although we recognize the taste and smell modalities, they are not part of the current review. Therefore, a detailed review for these modalities remains a research gap as far as our categorization approach is concerned. Finally, the last section concludes the paper.

Conceptual Framework Definition
A basic rule adopted is that there should exist one modality for each of the five human senses. That being established, the visual-based, audio-based, haptic-based, taste-based, and smell-based modalities have been created. Any other interaction unrelated to the modalities mentioned above is included in the sensor-based modality (Section 3.4). Each modality contains groups of contexts (Section 2.3), and ideally, the context categories of a single modality should not overlap. Although this is a debatable issue, the decision taken is that whenever an overlap occurs, researchers should analyze a new category semantically to justify its creation [51]. The taxonomy should always be expandable and adjustable when a new context group is identified. The context groups should be simple and broad enough so that someone can raise specific questions regarding common problems or testing methodologies. Some of the issues that should be avoided when categorizing, are the lack of focus, clarity, inspiration or creativity, redundant ideas, and the inability of locating the ideal case, identifying challenges, and inducing lateral thinking.
Each context should define its own set of tests to identify the efficacy of the interaction methods it includes (Section 2.4). For example, new users of an interaction method that involves equipment may be satisfied with the overall experience when completing a group of tasks for a specific period. At the same time, those having utilized the equipment for years find it difficult to operate [52]. Therefore, long-term usage of equipment should be a prerequisite when testing such interactions. In addition, the keywords for naming methods should be checked for adaptability by the research community. For example, the keyword gaze detection used in the previous classification [30] is replaced with eye gaze detection as it is more informative. Finally, some methods might be able to be positioned in more than one context group. For example, location-aware sound effects can be placed in both location-based and sound-based contexts. In this case, the researchers will have to define which group better characterizes their work, and if this is not possible, they should position their work in all relative context groups. A representation of the taxonomy levels is presented in the Figure 1 diagram. Future researchers may expand this diagram to include the "techniques" level, determining all techniques used to utilize a specific interaction method.

Interaction Tasks
Mixed reality environments contain four basic interaction tasks. As also stated in Bachmann et al.'s work in 2018 [53], these are selection, manipulation, system control, and navigation.

Interaction Tasks
Mixed reality environments contain four basic interaction tasks. As also stated in Bachmann et al.'s work in 2018 [53], these are selection, manipulation, system control, and navigation.

•
Selection: Refers to the task of selecting an object to perform actions, such as retrieving information or storing it as an argument for another action [54]. • Manipulation: Provides to the user the capability of changing any of the object's attributes, e.g., scale, position etc. [55]. • Navigation: Provides to the user the capability of navigating in an immersive environment by changing position or orientation [56]. • System control: Refers to the user capability of performing changes in the system state, such as menu-based changes [57].

Modality
A user interface [57] is based on information inputs and outputs (IO) via bidirectional human-computer communication channels. As input or output, we consider any human actions that convey meaning for interaction to a computer or any intentional augmentation or alteration of the human perceptual modalities. Every independent channel is called a modality, and every system that uses only one of these channels for IO is called a unimodal system. Systems that incorporate more than one of the modalities above are called multimodal. We define six modalities that allow information IO: visual-based, audiobased, haptic-based, taste-based, smell-based, and sensor-based modalities. As the taste and smell-based modalities are not reviewed in this paper, we do not provide definitions for them. Therefore, we define the visual, audio, haptic, and sensor-based modalities as follows: • Visual-based: The visual-based modality includes any state changes that a camera sensor can capture, convey meaning and can be used to describe the user's intention to interact or present visual feedback. • Audio-based: The audio-based modality contains all actions and feedback that include sound perception and sound stimuli.

•
Haptic-based: The haptic-based modality defines all interactions that can be perceived through the sense of touch or executed through graspable-tangible objects. • Sensor-based: Finally, the sensor-based modality includes all interactions requiring any sensor to capture information regarding an action or transmit feedback back to the user, besides visual, auditory, haptic, taste, and smell inputs/outputs. An example of this modality includes the pressure detection method.

Context
Context defines the conceptual framework through keywords used by researchers to describe their work in publications. A context is a subcategory of a modality that abstractly expresses specific interactions without fully explaining them. The usually adopted pattern comprises a noun followed by the word "based" to describe the base context, such as gesture-based, speech-based, or touch-based. At this point, it is worth noticing that the word "based" follows both a modality and a context, resulting in a repetitive naming convention for an interaction method (covered in the following subsection). We have decided to keep it this way in the taxonomical table, as the research community widely adopts these terms, and in most cases, these are used separately. With that being said, "a method of interaction that belongs in the marker-based context of the visual-based modality" is a sentence that conveys meaning and immediately positions the method in our proposed table.

Method
An interaction method is a keyword combination that describes or includes a series of coordinated procedures used to accomplish an interaction task. A pattern that researchers and inventors usually adopt to describe their methods comprises two parts, the first one to be the base medium of interaction and the second one to be a verb or a noun that defines the action to be performed. For example, eye gaze detection and body posture analysis are simple and easy to understand. However, "optical mouse sensor attached to finger" could be renamed as a finger motion tracking method in the motion-based context of the sensor-based modality. The techniques used to exploit a method should be avoided as part of the naming, as the resulting method names (e.g., YOLO hand gesture recognition and R-CNN hand gesture recognition, etc.), violate several ground rules previously defined, such as redundant ideas or inability of inducing lateral thinking. However, some methods presented in the final model-based diagram (e.g., fiducial marker recognition and infrared marker recognition) overlap in concept. Nevertheless, they are included as they induce lateral thinking.

Research Results
Before analyzing each modality, a brief state-of-the-art review is presented to identify current trends. In a recent study, Rokhsaritalemi et al. stressed that "mixed reality is an emerging technology that deals with maximum user interaction in the real world compared to other similar technologies" [58]. The impact of augmented reality on a user's satisfaction in numerous applications in the fields of engineering [59], archaeology [60], medicine [61], or education [62] cannot be questioned. Nevertheless, separating the physical world from the virtual has a significant impact on the user's immersion level. This enforced the development of mixed reality environments and upgraded augmented reality interactions to become more natural and include more aspects of the physical world. Chen et al. in 2017 [63] proposed a framework to boost the user's immersion experience in augmented reality through material-aware interactions by training a neural network for material recognition. In 2018, Chen et al. [64] mentioned that semantic understandings of the scene are necessary for mixed reality interactions to become realistic. The importance of structural information of physical objects is inextricably connected with proper augmentation and placement, but natural interactions between virtual and real objects require semantic knowledge. We have noticed a lack of publications related to material-aware MR interactions, and it seems that more research needs to be done in this field. Context-aware and material-aware interactions can lead to realistic physically-based sound propagation and rendering. In 2018 [65], Serafin et al., in their work on sonic interactions in virtual reality, concluded that the auditory outcomes of sound synthesis are not yet indistinguishable from real ones. Sonic interactions involve the techniques of sound propagation [66] and binaural rendering for binaural hearing [67] to provide immersive action and environmental sounds. An example includes the sound of the footsteps of a virtual man walking in a grass field. Through context-aware and material-aware interactions, the sound propagation algorithm would "consider" the material of the grass and the open area to generate the sound. The same material would sound differently inside a cave.

Visual-Based Modality
The visual-based modality includes any state changes that a camera sensor can capture, convey meaning, and can be used to describe the user's intention to interact or present visual feedback. Figure 2 visualizes all the contexts and methods identified for the visual-based modality. A detailed review for each context is presented in the following subsections.

Visual-Based Modality
The visual-based modality includes any state changes that a camera sensor can capture, convey meaning, and can be used to describe the user's intention to interact or present visual feedback. Figure 2 visualizes all the contexts and methods identified for the visual-based modality. A detailed review for each context is presented in the following subsections. There are two ways of interacting with or visually perceiving a mixed reality environment. As previously stated, an MR environment is where the real and the virtual coexist. Thus, two of the main ways of visual coexistence are [68]: • Optical see-through systems (OST): by displaying digital objects in a semi-transparent screen where real objects can be directly perceived through the glass.

•
Video see-through systems (VST): by displaying digital objects on a screen together with real objects captured by a camera sensor, commonly used by smartphones in AR.

Gesture-Based
When interacting through the gesture-based context, computers get visual input and recognize body language without any other sensory information. Eye gaze detection is included in this context of interaction. It is characterized by two major issues: (a) avoidance of unintended actions and (b) limitations related to eye movement accuracy. The gaze and dwell interaction model [69] is used for this method, as described by Microsoft [70], where basically the user needs to look at an object and retain this action by staring to select it. It has a high accessibility rate [71] as even severely constrained users can perform this interaction. In 2017, Piumsomboon et al. explored the advantages of this interaction by exploiting some of the basic functionalities of the human eye [72]. This method takes advantage of the eye inertial and the natural vestibulo-ocular reflex (the ability to lock a target regardless of head movements) and is used with head-mounted displays. The authors conclude that more research should be done analyzing the collected large-scale eye movement tracking data and improving user experience. Jacob [73] examined techniques and challenges related to eye movement interactions. He proposed an approach based on separating the actual data (eye movements) from noise and then estimating the user's intention of interaction. An interesting interaction method related to the eyes is pupil dilation detection. In [74], the authors used this method as a reliable indicator of cognitive workload.
In 2017 Samara et al. [75] performed task classification in HCI via the visual channel. The authors combined facial expression analysis and eye-gaze metrics for computer-based task classification. The outcome was that these two interaction methods combined resulted in higher classification accuracy. Facial-expression interactions exploit the face detection [76]  There are two ways of interacting with or visually perceiving a mixed reality environment. As previously stated, an MR environment is where the real and the virtual coexist. Thus, two of the main ways of visual coexistence are [68]: • Optical see-through systems (OST): by displaying digital objects in a semi-transparent screen where real objects can be directly perceived through the glass.

•
Video see-through systems (VST): by displaying digital objects on a screen together with real objects captured by a camera sensor, commonly used by smartphones in AR.

Gesture-Based
When interacting through the gesture-based context, computers get visual input and recognize body language without any other sensory information. Eye gaze detection is included in this context of interaction. It is characterized by two major issues: (a) avoidance of unintended actions and (b) limitations related to eye movement accuracy. The gaze and dwell interaction model [69] is used for this method, as described by Microsoft [70], where basically the user needs to look at an object and retain this action by staring to select it. It has a high accessibility rate [71] as even severely constrained users can perform this interaction. In 2017, Piumsomboon et al. explored the advantages of this interaction by exploiting some of the basic functionalities of the human eye [72]. This method takes advantage of the eye inertial and the natural vestibulo-ocular reflex (the ability to lock a target regardless of head movements) and is used with head-mounted displays. The authors conclude that more research should be done analyzing the collected large-scale eye movement tracking data and improving user experience. Jacob [73] examined techniques and challenges related to eye movement interactions. He proposed an approach based on separating the actual data (eye movements) from noise and then estimating the user's intention of interaction. An interesting interaction method related to the eyes is pupil dilation detection. In [74], the authors used this method as a reliable indicator of cognitive workload.
In 2017 Samara et al. [75] performed task classification in HCI via the visual channel. The authors combined facial expression analysis and eye-gaze metrics for computer-based task classification. The outcome was that these two interaction methods combined resulted in higher classification accuracy. Facial-expression interactions exploit the face detection [76] method and can be used in controlling a user's virtual avatar facial expressions [77] in facial emotion analysis [78] and emotion recognition [79]. Face recognition [80] also exploits the face detection method but is usually used to identify a person by facial characteristics. In 2017 Mehta et al. [81] conducted a review on human emotion recognition which could assist in teaching social intelligence in machines. Techniques commonly used to exploit emotion recognition and emotion analyses methods include, without being limited to, the geometric feature-based processes [82] and machine learning [83].
In the context of emotion analyses and human psychology, another method interprets body language. The body movements in immersive realities and how they enhance the sense of presence is examined in-depth by Slater's and Usoh's work [84]. In 2020 Lee [85] applied the Kinect Skeletal Tracking (KST) System in an augmented reality application to improve the social interaction of children with autism spectrum disorder (ASD). They used a Kinect camera to scan the therapist's body gestures and visualized them on a 3D virtual character. In another work [86], the authors used a webcam exploiting the user's body movement tracking method to interact with the AR system. The user had to do simple tasks like jumping, stretching, or boxing to "hit" the correct answers presented in the virtual world. Such methods utilizing physical exercise make the learning process more appealing to students of younger ages. Umeda et al. [87] exploited the body posture analysis method to locate a person's two hands perceived by a Firewire camera and superimpose artificial fire on them in real time. Algorithms of hand tracking and gesture recognition are used to detect the gestures through a camera sensor and perform functions accordingly. In 2016 Yousefi et al. [88] introduced a solution for real-time 3D hand gesture recognition. They used the embedded 2D camera of a mobile device, supporting 2D and 3D tracking, joint analysis and 10+ degrees of freedom. They accomplished these features by pre-processing the image to segment the hand from the background and matching the normalized binary vector outcome with gestures already stored in a database. Some of the issues to be further examined as regards the hand-gestures interaction method are (a) the efficiency of gesture recognition algorithms, (b) the efficiency on low contrast environments, (c) high consumption of computing resources, (d) lack of haptic feedback, and (e) hand occlusion with the augmented scene. However, the applications of hand gestures in real-life scenarios are unlimited due to the naturalness of the interaction. MixFab [89] is an example of how this interaction method can be helpful to non-expert users and allow them to perform tasks in a 3D environment with ease. The authors, Weichel et al., present this application prototype facilitating the manipulation of 3D objects by hand gestures and have them 3D printed without the need for any 3D modelling skills. Yang and Liao [90] utilized hand-gesture interactions to create interactive experiences for enhancing online language and cultural learning.
The virtual controls handling method of interaction refers to any interactivity between the user and a virtual control panel. In [91], Porter et al. demonstrated a prototype system that projects virtual controls onto any surface, using finger-tracking techniques to understand the user's intention of interaction with the virtual control panel. One of the benefits of the virtual controls interaction method is that any control component (virtual button, display, etc.) is rearrangeable. Therefore, its ergonomic design can vary to satisfy different needs and users. Besides hand gestures, they used 3D tabletop registration to place the virtual objects at any location on their tabletop.

Surface-Based
Surface detection is an interaction method of the surface-based interaction context in which the geometry, position, or rotation of a surface is considered in the interaction. Telepresence, the experience of distant worlds from vantage points [92], is a field where surface detection interactions can be applied to create realistic representations of the remote world to surpass the virtual-real occlusion problem. This method is commonly based on algorithms that provide solutions in the simultaneous localization and mapping (SLAM) [93] computation problem where both the point of view of an unknown environment and the user's position in it are updated and need to be tracked [94]. It is also used in marketing applications that exploit the advantages of immersive environments, such as the IKEA app [95]. In [96], the authors state that one of the well-designed features such applications should contain, among others, is "match between system and the real world", which refers to scaling and positioning virtual objects in the real world properly. In engineering, a frequently used method used for interactions is the surface analysis method. This method is somewhat more sophisticated than surface detection. It is used to analyze geometries and predict user intention of assembling different compartments [97] or interact with flexible clay landscapes to create new terrain models with surface refinement [98]. Surface refinement, another interaction method of the surface-based context, describes the techniques that  [99], an application capable of visually manipulating the haptic softness perception. The authors proposed that by virtually exaggerating the deformation effect of a material, it is possible to alter the haptic perception of the object's softness. Such interactions also have applications in realistic representations of mixed reality environments, such as a virtual object projecting shadow onto a natural surface or a virtual light source to illuminate an actual surface [100]. In [101], the authors present the interaction model and the techniques that lead to a successful application of instant indirect illumination for dynamic mixed reality scenes. For the placement of the 3D models in the MR environment, they used marker-based interactions.

Marker-Based
Marker-based interactions contain all interactions in immersive environments supported by marker tags, such as ARTag markers [102]. Onime et al. [103] performed a reclassification of markers for mixed reality environments based, among other things, on the level of realism, the level of immersion and visibility. However, not all markers are suitable for marker-based interaction. In [104], Mark Fiala mentions that Data Matrix, Maxicode, and Quick Response (QR) codes are ideal for conveying information when held in front of a camera and not localized. Thus, they are not suitable to be used as a fiducial marker, as is needed for applications of immersive realities. In contrast, InterSense [105], Reac-TIVision [106], Cyber-code [107], Visual Code [108], Binary Square Marker [109], Siemens Corporate Research (SCR) [110], and BinARyID [111] are several examples of fiducial marker systems that can be used in AR and MR applications. Mateos [112] proposed AprilTags3D that improves the accuracy of fiducial marker recognition of AprilTags in field robotics with only an RGB sensor by adding a third dimension to the marker detector. In [113], Wang et al. used the infrared marker recognition method and proposed an AR system for industrial equipment maintenance. As the markers are infrared and invisible to the naked eye, they do not cause any visual disturbance to the user [114]. A camera capable of capturing infrared light can detect information from infrared markers regarding position and rotation and successfully superimpose virtual objects in the real environment.

Location-Based
In a recent paper, Evangelidis et al. [11] demonstrated an application prototype that constitutes a continuation of the research proposal development called Mergin'Mode [115]. Their demonstration used the QR code recognition method to locate a user in the real world and serve virtual objects presented in an MR environment. In addition, QR codes contain information regarding the user's orientation and which content should be delivered from several predefined virtual worlds created to promote cultural heritage. Each observation point (the stations where the QR codes are located) exposed specific interactions in selecting a virtual agent to obtain information. In [116], Stricker et al. used the image registration method to determine the user's position and serve virtual content. The technique they describe needs a sufficient number of calibrated reference images stored in a database and requires a continuous internet connection. However, this method can provide the user's position accurately, thus improving the interaction experience.

Audio-Based Modality
The audio-based modality contains all actions and feedback, which are included in sound perception and sound stimuli. Auditory stimuli are essential in human understanding of the environment as contextual information provides situational awareness. Through auditory inputs, one can obtain information beyond visual boundaries (places out of sight, behind walls, etc.). Thus audio-based interactions can improve the immersive experience, making it "feel natural" and closer to the human way of experiencing reality. Figure 3 presents all the contexts and methods identified for the audio-based modality. Afterwards, a detailed analysis follows for each context. awareness. Through auditory inputs, one can obtain information beyond visual boundaries (places out of sight, behind walls, etc.). Thus audio-based interactions can improve the immersive experience, making it "feel natural" and closer to the human way of experiencing reality. Figure 3 presents all the contexts and methods identified for the audio-based modality. Afterwards, a detailed analysis follows for each context.

Sound-Based
Sound source recognition and sound visualization are two methods Shen et al. used in [117] to augment human vision in identifying sound sources for people with hearing disabilities. The system can successfully perform identification based on recognition algorithms that exploit the microphone's capabilities and AR tags placed on variant objects to position a virtual object that indicates the sound source. Rajguru et al. [118] reviewed research papers to determine challenges and opportunities in spatial soundscapes. Spatial soundscapes exploit the spatial sound perception method to enhance situational awareness. Audio superimposition is another method that is commonly used in audio augmented reality. In [119], the authors Harma et al. combined virtual sounds with real sounds captured by a microphone and reproduced them with stereo earphones. However, the users that utilized this method expressed difficulties in separating the virtual from the real sounds.
Sound synthesis, often used in real time for immersive realities, includes the techniques and algorithms that estimate the ground reaction force based on physical models exploited by a sound synthesis engine. Nordahl et al. [120] proposed a system that affords real-time sound synthesis of footsteps on different materials. To feed the sound synthesis engine, they used inputs regarding the surface material. For solid surfaces (metal and wood), they exploited the impact and friction model to simulate the act of walking and the sound of creaking wood accordingly. For aggregate surfaces (gravel, sand, snow, etc.), they further enhanced the previous models by using some reverberation by convolving in real time the footstep sounds with the impulse response recorded in different indoor environments. In addition, sound synthesis combined with sound

Sound-Based
Sound source recognition and sound visualization are two methods Shen et al. used in [117] to augment human vision in identifying sound sources for people with hearing disabilities. The system can successfully perform identification based on recognition algorithms that exploit the microphone's capabilities and AR tags placed on variant objects to position a virtual object that indicates the sound source. Rajguru et al. [118] reviewed research papers to determine challenges and opportunities in spatial soundscapes. Spatial soundscapes exploit the spatial sound perception method to enhance situational awareness. Audio superimposition is another method that is commonly used in audio augmented reality. In [119], the authors Harma et al. combined virtual sounds with real sounds captured by a microphone and reproduced them with stereo earphones. However, the users that utilized this method expressed difficulties in separating the virtual from the real sounds.
Sound synthesis, often used in real time for immersive realities, includes the techniques and algorithms that estimate the ground reaction force based on physical models exploited by a sound synthesis engine. Nordahl et al. [120] proposed a system that affords real-time sound synthesis of footsteps on different materials. To feed the sound synthesis engine, they used inputs regarding the surface material. For solid surfaces (metal and wood), they exploited the impact and friction model to simulate the act of walking and the sound of creaking wood accordingly. For aggregate surfaces (gravel, sand, snow, etc.), they further enhanced the previous models by using some reverberation by convolving in real time the footstep sounds with the impulse response recorded in different indoor environments. In addition, sound synthesis combined with sound spatialization can create realistic environmental sounds. Verron et al. [121] presented a 3D immersive synthesizer dedicated to environmental sounds. Based on Gaver's work [122], three basic categories concerning environmental sounds stood out: liquid sounds, vibrating objects, and aerodynamic sounds. The proposed system reduces the computational cost per sound source compared with implementations and constitutes a reliable tool for creating multiple sound sources in a 3D space.

Speech-Based
Besides engineering, sound-based interactions can produce benefits in education by exploiting speech recognition. Hashim et al. [123] developed an AR application to enhance vocabulary learning in early education. The authors integrated visual scripts for orthography and audio for phonology. As a result, they concluded that such applications provide high levels of user satisfaction and can significantly affect pronunciation as students repeat words and phrases until they get correct feedback. In Billinghurst et al.'s [124] work, the authors modified the VOMAR application [125]. They added the Ariadne [126] spoken dialogue system using Microsoft's Speech API [127] as a speech recognition engine to allow people to rapidly put together interior designs by arranging virtual furniture in empty rooms. Speech recognition is commonly used for computers to understand auditory commands and execute tasks. Among others, the authors used in their application the "select", "move", and "place" commands. Having an answer to the question "what are they saying", the next step is to answer the question "who is speaking". Speaker recognition is the method of interaction where machines can produce several pieces of information about the person who is speaking or even identify the person when the information provided can support that. In Hanifa et al.'s [128] exceptional review, several research papers regarding the speaker recognition method were included. They further classified this method in identification, verification, detection, segmentation, clustering, and diarization and proposed issues regarding variability, insufficient data, background noise, and adversarial attacks. Chollet et al. [129] reviewed technological developments in augmented reality worlds, emphasizing speech and gesture interfaces. They stressed that speaker verification could be used for authentication purposes before starting a dialogue, for example, regarding a bank transfer in a virtual world.
Lin et al. [130] designed and built an emotion recognition system in smart glasses to help improve human-to-human communication utilizing speech emotion analysis. The target group for such applications that could benefit the most includes doctors and autistic adults. The authors collected speech sentiments and speech intonation data using Microsoft's Azure API and their intonation model to analyze the collected data. The system communicates the detected emotions to the users via audio, visual, or haptic feedback in real time. Text to speech synthesis is an interaction method that synthesizes auditory outputs based on texts. Mirzaei et al. [131] combined speech recognition, speech to text synthesis, and text to speech synthesis methods to support deaf people. The system they presented captures the narrator's speech, converting it into texts, and displays it to the user's AR display. Additionally, the system converts typed text into speech to talk back to the narrator if the user wants to respond.

Music-Based
Musical interactions refer to all methods related to sounds arranged in time that composite the melody, harmony, rhythm, or timbre elements. In Altosaar et al.'s [132] paper, the authors present a technique of interacting with virtual objects to produce musical pre-recorded sounds. As the musician manoeuvers VR controllers and collides with the virtual objects, musical feedback is played. In addition, the musician can interact with fullbody movements to produce music, a method that can enhance musical expression. Bauer and Bouchara [133], exploiting the music visualization method, presented a work in progress where a user can visualize the parts of a music clip (intro, middle and outro) and individual audio components, such as kick, synthesizer or violins. Then, the user can manipulate these components, altering the final music outcome.

Location-Based
Bederson [134] proposed a technique that exploits the location-aware audio superimposition method to enhance the social aspects of museum visits. An infrared transmitter detects the visitor's location, which signals a computer to play or stop playing a pre-recorded description audio message. Paterson et al. [135] stressed that immersion, besides interaction, involves the "creation of presence", which is the feeling of being in a particular place. They have created a location-aware augmented reality game where the user navigates a field looking for ghosts. Based on the user location, sound effects were triggered indicating the paranormal activity level, thus exploiting the location-aware sound effects method. Lyons et al. [136] presented an RF-based location system to play digital sounds corresponding to the user's location. They have created a fantasy AR game where the user enters a game space (convention hall, gallery, etc.) with RF beacons set up. Through superimposed environmental sounds, sound effects, and narrator's guidance, the story of the game evolves. In another research paper [137], the authors described the technical aspects of a smartphone application that helps blind people experience their surrounding environment. They utilized location-aware sound effects and non-spatialized sound details, such as the full name of their location. It is worth mentioning that navigation accuracy posed an important challenge in both previous papers.

Haptic-Based Modality
The haptic-based modality defines all interactions that can be perceived through the sense of touch or executed through graspable-tangible objects. Figure 4 visualizes the contexts and methods that we identified for the haptic-based modality. In the following subsections, we explain each individual context. triggered indicating the paranormal activity level, thus exploiting the location-aware sound effects method. Lyons et al. [136] presented an RF-based location system to play digital sounds corresponding to the user's location. They have created a fantasy AR game where the user enters a game space (convention hall, gallery, etc.) with RF beacons set up. Through superimposed environmental sounds, sound effects, and narrator's guidance, the story of the game evolves. In another research paper [137], the authors described the technical aspects of a smartphone application that helps blind people experience their surrounding environment. They utilized location-aware sound effects and non-spatialized sound details, such as the full name of their location. It is worth mentioning that navigation accuracy posed an important challenge in both previous papers.

Haptic-Based Modality
The haptic-based modality defines all interactions that can be perceived through the sense of touch or executed through graspable-tangible objects. Figure 4 visualizes the contexts and methods that we identified for the haptic-based modality. In the following subsections, we explain each individual context.

Touch-Based
Yang et al. [138] presented an authoring tool system for AR environments using single-touch and multi-touch interaction methods of the touch-based context of interaction. Single-touch refers to any touch interaction that requires only one finger, while multi-touch benefits from more than one finger. These methods make the 'select' and 'manipulate' tasks available for virtual objects through the touch and drag interaction model. Jung et al. [139] proposed another on-site AR authoring tool for manipulating virtual objects, utilizing the multi-touch method for mobile devices. They presented two interaction tasks, selection and manipulation (select, translate, and rotate), which can be executed simultaneously, and stated that this method is convenient for non-expert users. Kasahara et al. [140] extended the touch interaction by manipulating (moving and rotating) the device. Later, Yannier et al. [141] examined the effect of mixed reality on learning experiences. In [142], the authors created a multi-touch input solution for head-mounted mixed reality systems, making any surface capable of interaction similar to touchscreens. The problems they provide solutions to are relative to the constant user's motion that constrain the use of a simple static background model. This approach also provides a

Touch-Based
Yang et al. [138] presented an authoring tool system for AR environments using single-touch and multi-touch interaction methods of the touch-based context of interaction. Single-touch refers to any touch interaction that requires only one finger, while multi-touch benefits from more than one finger. These methods make the 'select' and 'manipulate' tasks available for virtual objects through the touch and drag interaction model. Jung et al. [139] proposed another on-site AR authoring tool for manipulating virtual objects, utilizing the multi-touch method for mobile devices. They presented two interaction tasks, selection and manipulation (select, translate, and rotate), which can be executed simultaneously, and stated that this method is convenient for non-expert users. Kasahara et al. [140] extended the touch interaction by manipulating (moving and rotating) the device. Later, Yannier et al. [141] examined the effect of mixed reality on learning experiences. In [142], the authors created a multi-touch input solution for head-mounted mixed reality systems, making any surface capable of interaction similar to touchscreens. The problems they provide solutions to are relative to the constant user's motion that constrain the use of a simple static background model. This approach also provides a solution to the haptic feedback problem of many of the other HMD available interactions. Zhang et al. [143] examined the haptic touch-based interactions that can take place on the skin. They proposed ActiTouch, a novel touch segmentation technique that uses the human body as a radio frequency waveguide that enables touchscreen-like interactions.

Marker-Based
Jiawei et al. [144] suggested an interactive pen tool that transforms the actions in real space into the virtual three-dimensional CAD system space. To achieve that, they used fiducial marker recognition for markers placed on top of a simple pen to capture its position and orientation. Then, the system could locate any movements and draw virtual lines and shapes into the CAD system space. In [145], Yue et al. presented WireDraw, a 3D wire object drawing using a 3D pen extruder. They also used fiducial marker recognition for identification purposes in the mixed environment and superimposed virtual objects as indicators to help the user design high-quality 3D objects. Yet another pen-based interaction is demonstrated by Yun and Woo [146], this time using a space-occupancy-check algorithm. First, they used a depth map to capture the 3D information of the natural world, and then the algorithm checked if any point in a depth map collides with the geometry of the virtual object. Haptic-based modality also utilizes the RFID marker recognition method.
For example, Back et al. [147] described an interaction that can augment the experience of reading a book with sound effects. To achieve that, they used RFID tags embedded into each page and an electric field sensor located in the bookbinding to sense the proximity of the reader's hands and control audio parameters.

Controls-Based
Leitner et al. [148] presented IncreTable, a mixed reality tabletop game that exploits the digital controls handling interaction method. The users can perform interactions by utilizing digital pens with an embedded camera that tracks Anoto patterns printed on a special backlit foil. Hashimoto et al. [149] employed a joystick-based interaction to help the user drive a remote robot. The virtual 3D axes (x,y,z) were color-displayed on top of the real robot, which the user observed from a camera. Through the color indications of the robot's local axes system, the user could decide how to move the joystick to complete a task. Chuah et al. [150] used MR applications to train students in medical examinations of vulnerable patients (e.g., children with speaking disabilities). They have created two interfaces for interaction; a natural interface that utilizes a tablet for drawing shapes and a mouse-based interface for the same purpose. A final survey showed that the participants preferred the natural interface from the mouse-based interaction. Finally, Jiang et al. [151] proposed HiKeyb, a typing system with a style of mixed reality. They used a depth camera, a head-mounted display (HDM) and a QWERTY keyboard to enhance the immersive experience of typing in MR.
A practical method that allows haptic interactions is capacitive sensing. This method uses the measurable distortion that an object with electrical characteristics (such as capacitance) creates within an electric field oscillating at low frequency [152]. Poupyrev et al. [153] proposed Touché; a novel swept frequency capacitive sensing technique that can detect a touch event and recognize complex human hands and body configurations with extremely high accuracy. Kienzle et al. [154] proposed ElectroRing, a 3D-printed wearable ring similar to an active capacitive sensing system that can detect touch and pinch actions. Finally, HydroRing [155] was presented by Han et al., which is yet another wearable device that can provide tactile sensations of pressure, vibration, and temperature on the fingertip, enabling mixed-reality haptic interactions using liquid travelling through a thin, flexible latex tube.

Feedback-Based
Besides input functionality, the haptic-based modality also includes a feedback-based context of interaction. An exceptional review was published by Shepherd et al. [156]. The authors present a study on elastomeric haptic devices and stress that in haptics we have seen much slower technological advances than visual or auditory technologies because the skin is densely packed with mechanoreceptors distributed over a large area with complex topography. Their review includes haptic wearables, haptic input devices (dataglove, VR tracker, VR controller), haptic feedback output methods (direct force feedback, force substitution-skin deformation, shape morphing-surface, virtual shape rendering with lateral force control), and haptic perception. Talasaz et al. [157] explored the effect of force feedback or direct force feedback on the performance of knot tightening in roboticsassisted surgery. They stressed that presenting haptic force feedback to the user performing such tasks increased the effectiveness, although this is a debatable subject [158]. Another example of the use of the direct force feedback method is PneumoVolley [159]. The authors, Schon et al., presented a wearable cap prototype that is capable of providing pressure feedback to simulate the interaction between a virtual ball and the user's head through pneumatic actuation. Schorr and Okamura [160] presented a device wearable on the fingertip, capable of transmitting haptic stimuli exploiting the skin deformation method as a force substitution. In a greater depth, a review of wearable haptics and their application in AR is presented by Meli et al. [161]. Yang et al. [162] used the shape morphing method. They mentioned several shape-morphing surfaces, such as (a) the shape display inFORM, (b) a bio-inspired pneumatic shape-morphing device based on mesostructured polymeric elastomer plates capable of fast and complex transformations, and (c) rheological test results of MR fluid affected by a magnetic field. They stated that morphing devices are not yet mature, but they have explosive potential.

Sensor-Based Modality
The sensor-based modality includes all interactions requiring any type of sensor to capture information regarding an action or transmit feedback to the user, besides visual, auditory, haptic, taste, and smell inputs/outputs. Figure 5 presents all the contexts and methods recognized for the sensor-based modality. A detailed review for each context is presented in the following paragraphs. [158]. Another example of the use of the direct force feedback method is PneumoVolley [159]. The authors, Schon et al., presented a wearable cap prototype that is capable of providing pressure feedback to simulate the interaction between a virtual ball and the user's head through pneumatic actuation. Schorr and Okamura [160] presented a device wearable on the fingertip, capable of transmitting haptic stimuli exploiting the skin deformation method as a force substitution. In a greater depth, a review of wearable haptics and their application in AR is presented by Meli et al. [161]. Yang et al. [162] used the shape morphing method. They mentioned several shape-morphing surfaces, such as (a) the shape display inFORM, (b) a bio-inspired pneumatic shape-morphing device based on mesostructured polymeric elastomer plates capable of fast and complex transformations, and (c) rheological test results of MR fluid affected by a magnetic field. They stated that morphing devices are not yet mature, but they have explosive potential.

Sensor-Based Modality
The sensor-based modality includes all interactions requiring any type of sensor to capture information regarding an action or transmit feedback to the user, besides visual, auditory, haptic, taste, and smell inputs/outputs. Figure 5 presents all the contexts and methods recognized for the sensor-based modality. A detailed review for each context is presented in the following paragraphs.

Pressure-Based
Kim and Cooperstock [163] used the pressure detection method and proposed a wearable mobile foot-based surface simulator whose haptic feedback varies as a function of the applied foot pressure. Their purpose was to simulate the surface of a frozen pond and include "crack" sound effects based on the applied foot pressure. Qian et al. [164], exploiting the center of pressure trajectory method, combined floor pressure data for both feet to improve recognition

Pressure-Based
Kim and Cooperstock [163] used the pressure detection method and proposed a wearable mobile foot-based surface simulator whose haptic feedback varies as a function of the applied foot pressure. Their purpose was to simulate the surface of a frozen pond and include "crack" sound effects based on the applied foot pressure. Qian et al. [164], exploiting the center of pressure trajectory method, combined floor pressure data for both feet to improve recognition of visually ambiguous gestures. The outcome was a system reliable in recognizing gestures from similar body shapes but with different floor pressure distributions.

Motion-Based
Minh et al. [165] designed a low-cost smart glove equipped with a gyroscope/ accelerometer and vibration motors. Using finger motion tracking and hand motion tracking, hand and finger motions can be detected through the glove, including angular movements of the arm and joints. Using the same concept, Zhu et al. [166] proposed another smart glove capable of multi-dimensional motion recognition of gestures to achieve haptic feedback. The glove is equipped with piezoelectric mechanical stimulators that provide feedback when interacting with a virtual object. In another research paper [167], the authors used Microsoft's Hololens HMD, exploiting the head movement tracking and head gaze detection methods. They introduced a novel mixed reality system for nondestructive evaluation (NDE) training, for which, after a user study, they concluded that such systems are preferred for NDE training. In [168], Gul et al. presented a Kalman filter for headmotion prediction for cloud-based volumetric video streaming. Practically, server-side rendering systems, although they can provide high-resolution 3D content in any device with an acceptable internet connection speed, suffer from interaction latency. Although the research results were promising, the authors stressed that more research is necessary to examine several shortcomings, including predicting spherical quantities and more accurate predictions of head orientations.

Location-Based
Another context in the sensor-based modality that provides interaction for AR and MR environments is the location-based context. Radio frequency identification (RFID) uses passive, active, or semi-passive electronic tags that store a small amount of data, usually an ID or a link, and need readers to obtain the data from the tags. The communication of readers with the tags is made through RF [169]. Tags are deployed across the environment, and readers are carried on or attached to the positioning subjects. Benyon et al. [170], in their work "Presence and digital tourism", mention that using GPS/GNSS coordinates and nearfield communication (NFC) can create location-based triggered events for user interaction. Schier [171] designed a novel model for evaluating educational AR games. Using the location-aware triggered events method, the participants can interact with virtual historical figures and items, which GPS triggers to appear on their personal digital assistant (PDA). Other ways of acquiring the user's location to achieve interaction regarding navigation or present information include, without being limited, dead reckoning and WiFi [172,173], visible light communication and cellular networks (GSM, LTE, 5G) [174,175].

IoT-Based
Since the term IoT was first used in the late 1990s by Ashton [176], exceptional research work has been published. Through IoT-based interactions, it is easier to provide humans with situational and context awareness, enhance decision-making in everyday tasks, and control any type of system and offer novel interactions to disabled individuals. Atsali et al. [177] used open-source web technologies, such as X3DoM, to integrate 3D graphics on the web. Their paper described a methodology that connects IoT data with the virtual world and the benefits of using web-based human-machine interfaces. They used the Autodesk 3ds Max Design software to develop a 3D scene, including a four-apartment building. An autonomous, self-sufficient IoT mixed reality system was installed to exploit the infrastructure control and data monitoring methods for the water management infrastructure. In another work, Natephra and Motamedi [178] installed markers and an IoT system in an apartment for monitoring indoor comfort-related parameters, such as air temperature, humidity, light level, and thermal comfort. As the mobile device scans a marker, it can acquire real-time data transmitted by the sensors and visualize them as augmented reality content. Phupattanasilp et al. [179] introduced AR-IoT, a system that superimposes IoT data directly onto real-world objects and enhances object interaction. Their purpose was to achieve novel approaches to monitoring crop-related data, such as coordinates, plant color, soil moisture content, and nutrient information. Other applications of IoT-based data monitoring methods include fuel cell monitoring [180], campus maintenance [181], and environmental data monitoring utilizing serious gaming [182]. Using the IoT-based context, we can also interact with virtual agents to "humanize" interactions with objects or living organisms. Morris et al. [183] developed an avatar to express a plant's "emotion states" for plant-related data monitoring. The IoT system calculates arousal and valence based on soil, light, and moisture levels. By turn, the virtual avatar is designed to express these states on behalf of the plant. For example, the avatar will grow large and angry if the plant is left unwatered or be happy when the plant's soil moisture is at the proper levels.

Discussion
This paper describes a modality-based interaction-oriented taxonomy ( Figure 6) aiming to organize existing interaction methods and present a complete view of the heretofore accomplishments after a thorough review of more than 200 relevant papers. A significant challenge this venture undertakes to address is the lack of a well-defined and structured schema representing human-computer interaction in the context of mixed and augmented reality environments. For example, representations commonly included in research studies classify the visual-based modality by method (e.g., gesture recognition) and the sensorbased modality by the device (e.g., pen-based or mouse-based) [33]. Other representations, generally accepted by the research community, organize interactions by research areas, thus avoiding defining an in-depth classification [30]. These research areas are described using methods (e.g., facial expression analysis), umbrella terms (e.g., musical interaction), devices (e.g., mouse and keyboard), and keywords that arbitrarily define a research area such as "taste/smell sensors". Our classification approach is based on several established theories, such as the theory of modalities and the theory of perception, following basic taxonomy rules and aspiring to eliminate inconsistencies in previous classifications.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 17 Figure 6. A complete representation of our proposed classification using a modality-based interaction-oriented diagram that visualizes all the identified modalities, contexts, and methods. Figure 6. A complete representation of our proposed classification using a modality-based interactionoriented diagram that visualizes all the identified modalities, contexts, and methods.
Nevertheless, it should be stressed that the scope of the presented work is not to elaborate or evaluate the interaction methods gathered in terms of comparable parameters such as efficiency, popularity, applicability, impact on human beings, nor to present challenges and limitations. Besides, such a task would require a significant effort only to identify existing implementations and applications. Rather, the proposed work attempts to identify a sufficient amount of interaction methods and organize them exploiting welldefined taxonomy rules. These rules will, in turn, place them in the field of the context applied, such as gesture-based, marker-based, location-based, and others, and the associated modality, i.e., visual/audio/haptic/sensor-based. In other words, the scope of this review paper is only to identify distinct interaction methods and present them in a well-defined structured classification.
When defining the taxonomy rules, one of the principal decisions was that although an interaction method may be present in multiple contexts, we consider only those resulting in interactivity in an immersed environment (interaction-oriented classification). For example, in mixed reality projects, QR codes are employed to identify a user's location and therefore perform an interaction based on that location. Thus, although a QR code is usually placed in the marker-based context, in terms of interactivity, it is strictly located in the locationbased context since location is what triggers the interaction and not the actual QR code image representation. Examples where a fiducial marker system actively participates in the interaction include ARTag and AprilTag systems. Such systems enable marker-based augmented reality, where 3D virtual objects are positioned over the identified markers. The rotation and angle of view are constantly acquired from the markers through perspective transformations. Therefore, the latter can be included in the marker-based context.
We faced several challenges throughout our research process, such as fragmentation of abstract umbrella terms (e.g., musical interactions) and semantically identifying new context categories (e.g., music-based). This process revealed their respective methods (e.g., musical feedback and music visualization), thus clarifying the actual contribution of previous work and highlighting research gaps. It is noteworthy that while categorizing the recognized modalities with their respective contexts in a graph, some modalities share context (Figure 7). An interesting future study may present modalities, contexts, and methods as RDF triples, representing the proposed taxonomy semantically. As this paper makes the first attempt at classifying human perceptual modalities, further research is required to establish a robust representation. Furthermore, this taxonomy may be readjusted to include a wider field of modalities, such as the smell-based and taste-based modalities not covered in this paper. The proposed contexts and methods may also be enriched, and the final taxonomy can be expanded to cover the techniques used in each method. For example, the hand gesture recognition method may include techniques that exploit the YOLO (you only look once) algorithm or R-CNN (region convolutional neural network). Expanding the classification to include the techniques used for each method can later connect methods with testing frameworks and methodologies. Well-defined testing frameworks can provide a schema based on which users can endorse or disapprove an interaction model and, therefore, provide ratings generated through a shared process.