User Interactions for Augmented Reality Smart Glasses: A Comparative Evaluation of Visual Contexts and Interaction Gestures

: Smart glasses for wearable augmented reality (AR) are widely used in various applications, such as training and task assistance. However, as the ﬁeld of view (FOV) in the current AR smart glasses is narrow, it is di ﬃ cult to visualize all the information on the AR display. Besides, only simple interactions are supported. This paper presents a comparative and substantial evaluation of user interactions for wearable AR concerning visual contexts and gesture interactions using AR smart glasses. Based on the evaluation, it suggests new guidelines for visual augmentation focused on task assistance. Three di ﬀ erent types of visual contexts for wearable AR were implemented and evaluated: stereo rendering and direct augmentation, and non-stereo rendering and indirect augmentation with / without video background. Also, gesture interactions, such as multi-touch interaction and hand gesture-based interaction, were implemented and evaluated. We performed quantitative and qualitative analyses, including performance measurement and questionnaire evaluation. The experimental assessment proves that both FOV and visual registration between virtual and physical artifacts are important, and they can complement each other. Hand gesture-based interaction can be more intuitive and useful. Therefore, by analyzing the advantages and disadvantages of the visual context and gesture interaction in wearable AR, this study suggests more e ﬀ ective and user-centric guidance for task assistance. mode and the rotation mode are provided. The task is completed when the control and target objects are matched geometrically. in Q3 and Q8, there was a significant difference between the three methods. The non-stereo rendering and indirect augmentation showed better easy-to-use and less cumbersome-to-use ratings. In Q6, Q7, and Q10, there was no difference between the non-stereo rendering and indirect augmentation and the stereo rendering and direct augmentation. We found that, although the stereo


Introduction
Augmented reality (AR) has been widely utilized in various fields because it can visually superimpose different information, such as text, images, videos, and 3D virtual objects, onto corresponding physical artifacts. In the early stage of AR, it was mainly applied to entertainment, such as games, but recently it has been expanded and applied in various other fields, such as assembly, maintenance, training, remote collaboration, and smart manufacturing [1][2][3][4][5].
In different manufacturing fields, an industrial worker can perform a variety of tasks, such as monitoring, maintenance, assembly, and inspection of equipment and products [6]. In performing these tasks, the worker must review the dynamically changing status of equipment and products in real-time and take appropriate actions [7]. AR can support task assistance by overlaying machine status information onto a related physical artifact and by providing step-by-step instructions. Thus, the worker can perform their tasks more effectively. However, the worker must use a separate handheld device, such as a smartphone or tablet, to acquire task assistance. In particular, novice workers are required to have other training and instruction manuals when performing such tasks in an actual manufacturing environment. Thus, the worker cannot observe the equipment, products, and handheld device at the same time while performing tasks because their attention would be distracted. AR smart glasses, such as Epson Moverio [10] and Microsoft HoloLens [11], have been developed and commercialized to provide users with wearable AR applications in mobile environments. In the wearable AR, it is essential to help the worker to understand and interact with the AR-based visual instructions more intuitively with a less cognitive burden. For example, when the worker must follow different types of annotations augmented on the display of the smart glasses (right of Figure 1), it is necessary to provide user-friendly visual augmentation forms on the optical see-through display of the AR smart glasses.
However, AR smart glasses with a see-through display have a limited display size and viewing angle [6]. For example, Epson Moverio BT-200 has a 0.42 inch wide LCD panel, a dual display with 960 × 540 resolution, and a field of view (FOV) of approximately 23 degrees [10]. The FOV in current smart glasses is not large enough to display all the necessary information, which limits user-centric task assistance. Furthermore, the method for user interactions in the wearable AR environment is not natural and limited not to be applied to manufacturing services since it cannot support two-handed operations, and interaction is only supported by a multi-touchpad (e.g., Epson Moverio). Recently, two-handed operations have been supported in other smart glasses, such as MS HoloLens, but it can only support simple interactions, such as clicking and selection.
It is crucial to analyze user interactions concerning not only visual contexts but also interaction metaphors to help the worker to effectively utilize AR smart glasses with limited capabilities for task assistance [12]. Note that usability is the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use. On the other hand, user experience is a person's perceptions and responses that result from the use or anticipated use of a product, system or service, and thus it includes all the users' emotions, beliefs, preferences, perceptions, etc. [12]. Therefore, this paper proposes a comparative and substantial evaluation of user interactions for wearable AR concerning the visual context and interaction gesture using AR smart glasses. It also suggests new guidelines for visual augmentation focused on task assistance. We performed quantitative and qualitative analyses, including performance measurement and questionnaire evaluation. By analyzing the advantages and disadvantages of wearable AR for visualization and interaction, we could provide more effective guidance for user-centric task assistance.
In this research, the visual augmentation, according to the visual context in the see-through display of AR smart glasses, is classified as follows. AR smart glasses, such as Epson Moverio [10] and Microsoft HoloLens [11], have been developed and commercialized to provide users with wearable AR applications in mobile environments. In the wearable AR, it is essential to help the worker to understand and interact with the AR-based visual instructions more intuitively with a less cognitive burden. For example, when the worker must follow different types of annotations augmented on the display of the smart glasses (right of Figure 1), it is necessary to provide user-friendly visual augmentation forms on the optical see-through display of the AR smart glasses.

Stereo rendering and direct augmentation
However, AR smart glasses with a see-through display have a limited display size and viewing angle [6]. For example, Epson Moverio BT-200 has a 0.42 inch wide LCD panel, a dual display with 960 × 540 resolution, and a field of view (FOV) of approximately 23 degrees [10]. The FOV in current smart glasses is not large enough to display all the necessary information, which limits user-centric task assistance. Furthermore, the method for user interactions in the wearable AR environment is not natural and limited not to be applied to manufacturing services since it cannot support two-handed operations, and interaction is only supported by a multi-touchpad (e.g., Epson Moverio). Recently, two-handed operations have been supported in other smart glasses, such as MS HoloLens, but it can only support simple interactions, such as clicking and selection.
It is crucial to analyze user interactions concerning not only visual contexts but also interaction metaphors to help the worker to effectively utilize AR smart glasses with limited capabilities for task assistance [12]. Note that usability is the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use. On the other hand, user experience is a person's perceptions and responses that result from the use or anticipated use of a product, system or service, and thus it includes all the users' emotions, beliefs, preferences, perceptions, etc. [12]. Therefore, this paper proposes a comparative and substantial evaluation of user interactions for wearable AR concerning the visual context and interaction gesture using AR smart glasses. It also suggests new guidelines for visual augmentation focused on task assistance. We performed quantitative and qualitative analyses, including performance measurement and questionnaire evaluation. By analyzing the advantages and disadvantages of wearable AR for visualization and interaction, we could provide more effective guidance for user-centric task assistance.
In this research, the visual augmentation, according to the visual context in the see-through display of AR smart glasses, is classified as follows.

1.
Stereo rendering and direct augmentation 2. Non-stereo rendering and indirect augmentation with video background 3.
Non-stereo rendering and indirect augmentation without video background The stereo rendering and direct augmentation method superimposes virtual information directly onto its physical artifact through the see-through display of smart glasses (Figures 2d  and 3a). The stereo rendering can provide stereo rendering-based immersive visualization, and the overlaid virtual information is almost matched or registered with its corresponding physical artifact. For this reason, a registration process is required to match the virtual information with the physical artifact. The non-stereo rendering and indirect augmentation with video background overlays virtual information indirectly onto the physical artifact in the image taken by the smart glasses' AR camera and then provides the augmented image to the user through the see-through display of AR smart glasses (Figures 2e and 3b). In this case, the virtual information does not match with the physical artifact concerning the user's viewpoint. The non-stereo rendering and indirect augmentation without video background method removes the camera image from the non-stereo rendering and indirect augmentation with video background and presents only the virtual information on the display (Figures  2f and 3c). This method also does not support matching virtual information with the physical artifact. 2. Non-stereo rendering and indirect augmentation with video background 3. Non-stereo rendering and indirect augmentation without video background The stereo rendering and direct augmentation method superimposes virtual information directly onto its physical artifact through the see-through display of smart glasses (Figures 2d and  3a). The stereo rendering can provide stereo rendering-based immersive visualization, and the overlaid virtual information is almost matched or registered with its corresponding physical artifact. For this reason, a registration process is required to match the virtual information with the physical artifact. The non-stereo rendering and indirect augmentation with video background overlays virtual information indirectly onto the physical artifact in the image taken by the smart glasses' AR camera and then provides the augmented image to the user through the see-through display of AR smart glasses (Figures 2e and 3b). In this case, the virtual information does not match with the physical artifact concerning the user's viewpoint. The non-stereo rendering and indirect augmentation without video background method removes the camera image from the non-stereo rendering and indirect augmentation with video background and presents only the virtual information on the display (Figures 2f and 3c). This method also does not support matching virtual information with the physical artifact.  For most AR smart glasses, such as Epson Moverio (Figure 2a), the stereo rendering and direct augmentation approach has narrower FOV than the non-stereo rendering and indirect augmentation through the see-through display. Instead, it can provide consistent visual augmentation with physical objects. Therefore, different visual contexts have their advantage and disadvantage concerning FOV and visual registration, which can influence task performance and usability. However, there is little research to evaluate this issue substantially.
Besides, since most AR smart glasses can only support simple interactions using the touch gesture and hand gesture, it is challenging to provide more user-centric interactions using full hand gestures. However, one drawback is to attach another sensor, such as a depth sensor to the smart AR glasses, which may prevent the worker from conducting tasks and cause another cognitive load. Besides, it is essential to calibrate the coordinate of wearable AR and that of the hand tracking sensor. Furthermore, it is still necessary to evaluate the usability of interactions concerning task performance and the cognitive load of the industrial worker in an actual manufacturing environment. For most AR smart glasses, such as Epson Moverio (Figure 2a), the stereo rendering and direct augmentation approach has narrower FOV than the non-stereo rendering and indirect augmentation through the see-through display. Instead, it can provide consistent visual augmentation with physical objects. Therefore, different visual contexts have their advantage and disadvantage concerning FOV and visual registration, which can influence task performance and usability. However, there is little research to evaluate this issue substantially.
Besides, since most AR smart glasses can only support simple interactions using the touch gesture and hand gesture, it is challenging to provide more user-centric interactions using full hand gestures. However, one drawback is to attach another sensor, such as a depth sensor to the smart AR glasses, which may prevent the worker from conducting tasks and cause another cognitive load. Besides, it is essential to calibrate the coordinate of wearable AR and that of the hand tracking sensor. Furthermore, it is still necessary to evaluate the usability of interactions concerning task performance and the cognitive load of the industrial worker in an actual manufacturing environment.
As the most popular gesture interactions in wearable AR are touch-based and gesture-based [6], we have classified them as follows.

Multi-touch interaction
Through the hand gesture-based interaction, the user wearing AR smart glasses can manipulate virtual objects through hand gestures directly ( Figure 4a). Besides, it supports bi-manual operations as the worker can use both hands. However, to support hand gesture-based interaction, an additional depth sensor attached to the front of smart glasses is required, and coordinate systems between the smart glasses and the depth sensor must be calibrated. On the other hand, the multi-touch interaction uses a touchpad connected to the AR smart glasses (Figure 4b). The usage is similar to the touch interface of a smart device, such as the smartphone or tablet. Thus, its interaction is familiar to the worker. However, it cannot support bi-manual operations as the worker must touch the pad with one hand. As the most popular gesture interactions in wearable AR are touch-based and gesture-based [6], we have classified them as follows.

Multi-touch interaction
Through the hand gesture-based interaction, the user wearing AR smart glasses can manipulate virtual objects through hand gestures directly ( Figure 4a). Besides, it supports bi-manual operations as the worker can use both hands. However, to support hand gesture-based interaction, an additional depth sensor attached to the front of smart glasses is required, and coordinate systems between the smart glasses and the depth sensor must be calibrated. On the other hand, the multi-touch interaction uses a touchpad connected to the AR smart glasses (Figure 4b). The usage is similar to the touch interface of a smart device, such as the smartphone or tablet. Thus, its interaction is familiar to the worker. However, it cannot support bi-manual operations as the worker must touch the pad with one hand.  This study has the following contributions.
• We have conducted a comparative and substantial evaluation of user interactions for AR smart glasses concerning visual contexts and gesture interactions, which suggests new directions for task assistance in wearable AR.

•
We have performed quantitative and qualitative experiments to analyze task performance and usability to provide user-centric visualization and interaction.

•
Through the experiment, we have found that FOV and visual registration complement each other so that it is necessary to increase the visual focus of the information by calculating degreeof-interest.

•
The analysis result was tested by a real use case of industrial workers concerning cognitive load by performing tasks using wearable AR.
This paper is organized as follows. Section 2 presents related work. Section 3 defines the This study has the following contributions.
• We have conducted a comparative and substantial evaluation of user interactions for AR smart glasses concerning visual contexts and gesture interactions, which suggests new directions for task assistance in wearable AR.

•
We have performed quantitative and qualitative experiments to analyze task performance and usability to provide user-centric visualization and interaction.
• Through the experiment, we have found that FOV and visual registration complement each other so that it is necessary to increase the visual focus of the information by calculating degree-of-interest.

•
The analysis result was tested by a real use case of industrial workers concerning cognitive load by performing tasks using wearable AR.
This paper is organized as follows. Section 2 presents related work. Section 3 defines the visualization and interaction methods in wearable AR. Section 4 describes the first user study to compare the methods for visual contexts and gesture interactions through virtual object manipulations. Experimental results were evaluated through quantitative and qualitative analyses. Section 5 presents the second user study to assess the methods by comparing different FOVs and evaluating a 3D matching operation between physical and virtual objects. Section 6 describes a real use case of industrial workers who have experienced the proposed approach. Section 7 discusses user studies for user-centric task assistance with AR smart glasses and suggests new challenges. Section 8 concludes the paper and suggests future works.

Related Work
In recent years, wearable AR using smart glasses has been applied to various applications and evaluated for user-centric task assistance [1,8,[13][14][15]. Previous research works in wearable AR can be classified into two areas: (1) visual context-based visualization and (2) gesture interactions in the physical environment.

Visual Context-Based Visualization
Several research works have been conducted to provide visualization methods concerning visual contexts for practical task assistance in different environments [1,8,9,13]. Most smart AR glasses support see-through augmentation. However, regarding visual contexts, they can provide either direct or indirect augmentation on physical artifacts. Direct augmentation requires visual registration between physical and virtual objects. Although it can provide a more immersive feeling, its FOV is narrower than the indirect augmentation. Therefore, there is a trade-off between these augmentations concerning visual contexts. However, there is little research to analyze visual contexts in wearable AR for task assistance.
Previous studies proposed methods for task assistance in wearable AR. Zhu et al.
[9] proposed AR-Mentor, which is an AR system for assisting in maintenance and repair tasks. Their method combined AR smart glasses with six degrees of freedom (DOF) pose tracking and virtual personal assistance to instruct a novice in performing maintenance tasks. Funk et al. [16] proposed HoloCollab, which is a system using AR smart glasses to create an augmented training experience for both trainer and trainee. Some other works have attempted to combine an optical see-through display with projector-based spatial augmented reality [17], and have discussed how a working system should be developed by integrating smart glasses with a question-answering module [18].
The visualization methods were also compared and evaluated in various devices [19,20], including tablets, head mounted displays (HMDs), and projectors. Zeng et al. [1] investigated what characteristics of eye-wearable technology impact user performance in machine maintenance by comparing eyewear-central and eyewear-peripheral visualizations depending on the position of the display. They found that the eyewear-central display resulted in faster task completion time than the eyewear-peripheral display. Besides, the influence of the graphical context on task performance was also evaluated [8,13,14]. Robertson et al. [13,14] assessed the effects of registration error in a Lego block placement task and the effectiveness of graphical context for task assistance. Their work showed that adding contexts to a scene due to registration error could help a user perform given tasks more effectively by reducing the number of mistakes and task completion time. Khuong et al. [8] evaluated the effectiveness of an AR-based context-aware assembly support system using two different visual overlays. Their experimental results showed that the visual overlay that presented target status next to real objects outperformed the direct overlaid visualization on the actual objects.

Gesture Interactions
The user can perform either a 2D-based task, such as selecting buttons and typing texts, or a 3D-based task, such as 3D orientation and translation [21]. Therefore, offering AR smart glasses with better input metaphors makes the interaction more intuitive and efficient, which enables the user to handle more complicated and visually demanding tasks [15]. In particular, this paper describes two widely used approaches: gesture-based interaction and touch-based interaction.
Gesture-based interaction tracks the position and orientation of the user's fingers or hands using the camera or inertial measurement unit (IMU) of the AR smart glasses and supports interactions with virtual objects in the AR environments [21][22][23][24][25][26][27][28][29][30]. Ha et al. [21] proposed WeARHand, which allows the user to manipulate virtual 3D objects with a bare hand in a wearable AR environment. Dudley et al. [29] proposed a system that enabled users to type on a virtual keyboard using a head-mounted AR device and hand localization derived from body-fixed sensors. Meanwhile, Hsieh et al. [28] distilled a set of design principles for mid-air gestural interaction for smart glasses, one of which showed the potential of using an independent hand tracking device, such as a haptic glove, to form an integrated wearable system with smart glasses. Hand gesture-based interaction can support natural gestures, such as pointing and grabbing, to interact with virtual objects [15].
In other research works, touch-based interaction has been widely used in various smart and mobile devices [31][32][33][34][35][36]. Ahn et al. [33] proposed using a smartwatch as an indirect input device to make an effective smart glasses interaction method for tasks, such as text entry. Whitmire et al. [34] proposed DigiTouch, which is a touch-sensitive glove that enables thumb-to-finger interaction for eyes-free input on wearable systems. DigiTouch uses thin, partially conductive fabric strips along with the fingers and a conductive patch on the thumb pad. Wang et al. [35] presented PalmType, which uses palms as interactive surfaces to enable an intuitive and efficient keyboard interface for smart glasses. Lissermann et al. [36] proposed the augmentation of accessories worn behind the ear with a device that unobtrusively instrumented the ear as an interactive surface.
Some researchers have also evaluated interaction methods using AR smart glasses [37,38]. Terrenghi et al. [38] found that the modes of manipulation of physical versus digital media were fundamentally different. However, previous methods have not been fully evaluated by regarding both the visual context and the interaction gesture, although both aspects may affect each other.
Although previous research works presented various approaches to the visualization and interaction in wearable AR environments, most of them have focused on developing domain-specific applications rather than conducting comparative evaluations for user-centric task assistance concerning different types of visual augmentation. Regarding the visual context, there has been little research on evaluating task performance through the comparison between FOV and visual registration in wearable AR. In terms of the gesture interaction, further study is still needed for task assistance with combined gesture interactions in manufacturing operations. Note that some operations were effective in 2D-based tasks, such as selecting buttons, but others were useful for 3D-based tasks, such as 3D rotation and translation of virtual objects [15].
Furthermore, although several research works have suggested new interactions, there is further room for improvement in wearable AR. Therefore, in this study, we analyzed not only the usability of the visual context but also interaction methods using AR smart glasses. Quantitative and qualitative analyses were carried out by comparing and evaluating task performance. This paper also presents a new challenge by analyzing experimental results.

Visual Context and Gesture Interaction
This section defines visual contexts and gesture interactions for comparative evaluation in wearable AR.

Visual Augmentation Depending on Visual Contexts
We classified three different types of visual augmentation depending on how virtual objects are superimposed on the optical see-through display of AR smart glasses. As shown in Figure 5, there are two differences in the visual augmentation, depending on the visual context. One difference is the amount of information visualized in the display of the smart glasses. The other is how exactly the physical artifact and its corresponding virtual object are matched through the display or the camera. presents a new challenge by analyzing experimental results.

Visual Context and Gesture Interaction
This section defines visual contexts and gesture interactions for comparative evaluation in wearable AR.

Visual Augmentation Depending on Visual Contexts
We classified three different types of visual augmentation depending on how virtual objects are superimposed on the optical see-through display of AR smart glasses. As shown in Figure 5, there are two differences in the visual augmentation, depending on the visual context. One difference is the amount of information visualized in the display of the smart glasses. The other is how exactly the physical artifact and its corresponding virtual object are matched through the display or the camera. The stereo rendering and direct augmentation is a visualization method that supports almost seamless matching between physical and virtual objects through the see-through display. Thus, a high degree of immersion can be provided to the user, and both the physical object and the virtual The stereo rendering and direct augmentation is a visualization method that supports almost seamless matching between physical and virtual objects through the see-through display. Thus, a high degree of immersion can be provided to the user, and both the physical object and the virtual object can be viewed without mismatch. However, as the FOV of the see-through display for human eyes is narrower, as shown in Figure 5, even if the virtual object can be matched with the physical object, the amount of information that can be visualized at once is smaller than that of the non-stereo rendering and indirect augmentation method. On the other hand, the non-stereo rendering and indirect augmentation with the video background is a method of augmenting a virtual object on a captured camera image that is rendered on the see-through display. Usually, the camera FOV is larger than that of the display of the AR smart glasses. However, the visual AR information does not match with a physical object. The matching is accomplished from the camera view, not from the user's perspective. This visual augmentation provides more information to the user simultaneously even if accurate matching between the physical and the virtual objects is not achieved through the display of the smart glasses. However, inconsistency between the physical and virtual objects can cause a cognitive burden on the user in the process of performing tasks. The non-stereo rendering and indirect augmentation without video background deletes camera background image in the second method and renders only virtual information. Thus, unlike the second method, the background image of the camera cannot be seen.

Gesture Interaction
Considering currently available AR smart glasses, interactions with virtual objects in the wearable AR environment can be classified into hand gesture-based interaction and multi-touch interaction, using a touchpad ( Figure 4). In particular, as shown in Figure 6a, the hand gesture interaction is similar to how the user performs a task using a hand in a real environment. Through different hand gestures, as shown in Figure 6b, it is possible to achieve interaction, such as selecting (picking or grabbing), moving (dragging), zooming (pinching), or rotating a virtual object, that is visualized in wearable AR.
captured camera image that is rendered on the see-through display. Usually, the camera FOV is larger than that of the display of the AR smart glasses. However, the visual AR information does not match with a physical object. The matching is accomplished from the camera view, not from the user's perspective. This visual augmentation provides more information to the user simultaneously even if accurate matching between the physical and the virtual objects is not achieved through the display of the smart glasses. However, inconsistency between the physical and virtual objects can cause a cognitive burden on the user in the process of performing tasks. The non-stereo rendering and indirect augmentation without video background deletes camera background image in the second method and renders only virtual information. Thus, unlike the second method, the background image of the camera cannot be seen.

Gesture Interaction
Considering currently available AR smart glasses, interactions with virtual objects in the wearable AR environment can be classified into hand gesture-based interaction and multi-touch interaction, using a touchpad ( Figure 4). In particular, as shown in Figure 6a, the hand gesture interaction is similar to how the user performs a task using a hand in a real environment. Through different hand gestures, as shown in Figure 6b, it is possible to achieve interaction, such as selecting (picking or grabbing), moving (dragging), zooming (pinching), or rotating a virtual object, that is visualized in wearable AR. To support hand gesture-based interactions with virtual objects, the position and posture of the user's hand must be consistently tracked and recognized. Furthermore, the traced real hand and the virtual hand in wearable AR must be synchronized on the display of AR smart glasses. For this purpose, we used a depth sensor, which tracks the real hand and acquires the positions and postures of the palm and finger joints in real-time. By matching the generated virtual hand with the actual hand, the user can directly and intuitively interact with the virtual object in the wearable AR space.
On the other hand, multi-touch interaction is easily performed on a touchpad connected to smart glasses, as shown in Figure 7. The touch interaction is very intuitive and natural. However, only one hand is available during the interaction. First, the user sees a cursor shown on the display of the smart glasses and moves the cursor on the touchpad to select the object. Then, they can interact with the virtual object through rotation and translation. However, as it is not possible to perform multi DOF To support hand gesture-based interactions with virtual objects, the position and posture of the user's hand must be consistently tracked and recognized. Furthermore, the traced real hand and the virtual hand in wearable AR must be synchronized on the display of AR smart glasses. For this purpose, we used a depth sensor, which tracks the real hand and acquires the positions and postures of the palm and finger joints in real-time. By matching the generated virtual hand with the actual hand, the user can directly and intuitively interact with the virtual object in the wearable AR space.
On the other hand, multi-touch interaction is easily performed on a touchpad connected to smart glasses, as shown in Figure 7. The touch interaction is very intuitive and natural. However, only one hand is available during the interaction. First, the user sees a cursor shown on the display of the smart glasses and moves the cursor on the touchpad to select the object. Then, they can interact with the virtual object through rotation and translation. However, as it is not possible to perform multi DOF interaction, such as simultaneous translation and rotation like in the hand gesture method, it is necessary to selectively and sequentially perform the interaction after switching from one mode to the other through a menu selection. However, during the communication in wearable AR, the touchpad must be held in one hand, which limits the use of both hands.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 27 interaction, such as simultaneous translation and rotation like in the hand gesture method, it is necessary to selectively and sequentially perform the interaction after switching from one mode to the other through a menu selection. However, during the communication in wearable AR, the touchpad must be held in one hand, which limits the use of both hands.

Overview of the Experiment
We conducted a comparative evaluation of user interactions concerning the visual context and gesture interaction. We assumed that FOV and visual registration would be influential factors for supporting user-centric task assistance, and the hand gesture would also provide more natural and intuitive interactions in wearable AR. The hand gesture is more natural in the physical environment, such as a manufacturing environment, and it can also support bi-manual operations. For evaluating gesture interactions, the visual context is based on the non-stereo rendering and indirect augmentation because most AR smart glasses can support non-stereo rendering and indirect augmentation rather than stereo rendering and direct augmentation [6,10,15].

Experimental Setup
Epson's Moverio BT-200 was used as the AR smart glasses [6,10]. The transparent screen is located at the front of the user's view (eyewear-central), and a separate touchpad connected to the smart glasses is provided to enable multi-touch interaction. The wearable AR system was implemented using Unity3D [39]. We also used the Vuforia software development kit (SDK) [40] to track natural feature-based AR markers, which was used to overlay virtual objects onto the physical environment.
We used a Leap Motion sensor [41] as a depth sensor to track the user's actual hands to support the hand gesture interaction. As shown in Figure 8, the Leap Motion sensor was attached to the front of the safety helmet of the worker or the smart glasses, while the AR camera was embedded in the AR smart glasses. The absolute orientation algorithm [42,43] was used for the calibration of the two coordinate systems.

Overview of the Experiment
We conducted a comparative evaluation of user interactions concerning the visual context and gesture interaction. We assumed that FOV and visual registration would be influential factors for supporting user-centric task assistance, and the hand gesture would also provide more natural and intuitive interactions in wearable AR. The hand gesture is more natural in the physical environment, such as a manufacturing environment, and it can also support bi-manual operations. For evaluating gesture interactions, the visual context is based on the non-stereo rendering and indirect augmentation because most AR smart glasses can support non-stereo rendering and indirect augmentation rather than stereo rendering and direct augmentation [6,10,15].

Experimental Setup
Epson's Moverio BT-200 was used as the AR smart glasses [6,10]. The transparent screen is located at the front of the user's view (eyewear-central), and a separate touchpad connected to the smart glasses is provided to enable multi-touch interaction. The wearable AR system was implemented using Unity3D [39]. We also used the Vuforia software development kit (SDK) [40] to track natural feature-based AR markers, which was used to overlay virtual objects onto the physical environment.
We used a Leap Motion sensor [41] as a depth sensor to track the user's actual hands to support the hand gesture interaction. As shown in Figure 8, the Leap Motion sensor was attached to the front of the safety helmet of the worker or the smart glasses, while the AR camera was embedded in the AR smart glasses. The absolute orientation algorithm [42,43] was used for the calibration of the two coordinate systems. In particular, as the origin, transformation, and scale are different between the coordinate of the camera of the smart glasses and that of the depth sensor, a calibration operation for matching the two coordinates must be performed, as shown in Figure 9. In particular, as the origin, transformation, and scale are different between the coordinate of the camera of the smart glasses and that of the depth sensor, a calibration operation for matching the two coordinates must be performed, as shown in Figure 9. In particular, as the origin, transformation, and scale are different between the coordinate of the camera of the smart glasses and that of the depth sensor, a calibration operation for matching the two coordinates must be performed, as shown in Figure 9. , a set of points in the wearable AR coordinate, be corresponding points in 3D space. The absolute orientation algorithm finds the similarity transformation parameters (rotation R, translation t, and scaling c) that give the least mean squared error between these two point sets. The minimum value of the mean squared error is defined as follows.
In summary, to minimize , we have to find R, t, c, and such that Let X Leap = {x 1 , x 2 , · · · , x n }, a set of points in the Leap Motion coordinate, and Y AR = y 1 , y 2 , · · · , y n , a set of points in the wearable AR coordinate, be corresponding points in 3D space. The absolute orientation algorithm finds the similarity transformation parameters (rotation R, translation t, and scaling c) that give the least mean squared error between these two point sets. The minimum value e 2 of the mean squared error is defined as follows.
In summary, to minimize e 2 , we have to find R, t, c, and such that where trace(DS) is the sum of the elements on the main diagonal of DS, xy is a covariance matrix of X Leap and Y AR , µ Leap and µ AR are mean vectors of X Leap and Y AR , σ 2 Leap and σ 2 AR are variances of X Leap and Y AR , respectively. Let a singular value decomposition of xy be UDV T , The detailed explanation and the proof are described in Reference [43].
Through the calibration between the AR camera and the Leap Motion sensor, the real hand can be matched with a virtual hand in the wearable AR space, as shown in Figure 10.
of and , and are mean vectors of and YAR, and are variances of and , respectively. Let a singular value decomposition of ∑ be , D = diag( ), ≥ ≥ ⋯ ≥ 0), and The detailed explanation and the proof are described in Reference 43. Through the calibration between the AR camera and the Leap Motion sensor, the real hand can be matched with a virtual hand in the wearable AR space, as shown in Figure 10.

Experimental Design
For the comparative evaluation of visual contexts and interaction gestures in the wearable AR environment, we designed two user studies. Three experimental tasks, such as (1) selection, (2) dragging and dropping, and (3) 3D matching, were designed to compare and evaluate the usability and task performance concerning the visual context and interaction gesture in wearable AR. Among the three tasks, two tasks, such as dragging and dropping and 3D matching, were conducted to evaluate the visual context. Two tasks, such as selecting and 3D matching, were conducted to assess the interaction gesture. When a subject looked at a specific area with a generic feature-based AR marker(s) through the camera of AR smart glasses, virtual objects were augmented on or above the AR marker, and then the participant was required to perform the given task.
Twenty university students (fourteen males and six female) were recruited with an average age of 25.9 (SD = 1.25) in a controlled laboratory environment [44]. Four of them had experience in AR applications, two had experience in using Leap Motion, and the others had no such experience. Before the experiment, subjects were instructed on how to perform the three tasks, along with a description of the visualization and interaction methods, and then practiced until they were almost familiar with each method. After practice, subjects performed full-scale tasks in a random order to avoid the learning effect. In other words, the order of conducting three tasks, and the order of user studies were randomly assigned to each subject.
Through the experiment, quantitative and qualitative assessments of visual contexts and interaction gestures were conducted. For the quantitative evaluation, the completion time of each task was measured. After the completion of all the tasks, surveys and interviews were also conducted for qualitative evaluation. The questionnaire consisted of questions that were evaluated based on a 7-point Likert scale (1: strongly disagree, 7: strongly agree). In particular, we used all the questions from the System Usability Scale (SUS) [45] without any modification. SUS is well-known as an objective questionnaire for usability evaluation. We have not listed them in this paper. Figure 11 shows a snapshot of the worker who performed a matching task with the Universal Robot-3 (UR3) Figure 10. Actual hand detection and calibration with the virtual hand in the wearable AR.

Experimental Design
For the comparative evaluation of visual contexts and interaction gestures in the wearable AR environment, we designed two user studies. Three experimental tasks, such as (1) selection, (2) dragging and dropping, and (3) 3D matching, were designed to compare and evaluate the usability and task performance concerning the visual context and interaction gesture in wearable AR. Among the three tasks, two tasks, such as dragging and dropping and 3D matching, were conducted to evaluate the visual context. Two tasks, such as selecting and 3D matching, were conducted to assess the interaction gesture. When a subject looked at a specific area with a generic feature-based AR marker(s) through the camera of AR smart glasses, virtual objects were augmented on or above the AR marker, and then the participant was required to perform the given task.
Twenty university students (fourteen males and six females) were recruited with an average age of 25.9 (SD = 1.25) in a controlled laboratory environment [44]. Four of them had experience in AR applications, two had experience in using Leap Motion, and the others had no such experience. Before the experiment, subjects were instructed on how to perform the three tasks, along with a description of the visualization and interaction methods, and then practiced until they were almost familiar with each method. After practice, subjects performed full-scale tasks in a random order to avoid the learning effect. In other words, the order of conducting three tasks, and the order of user studies were randomly assigned to each subject.
Through the experiment, quantitative and qualitative assessments of visual contexts and interaction gestures were conducted. For the quantitative evaluation, the completion time of each task was measured. After the completion of all the tasks, surveys and interviews were also conducted for qualitative evaluation. The questionnaire consisted of questions that were evaluated based on a 7-point Likert scale (1: strongly disagree, 7: strongly agree). In particular, we used all the questions from the System Usability Scale (SUS) [45] without any modification. SUS is well-known as an objective questionnaire for usability evaluation. We have not listed them in this paper. Figure 11 shows a snapshot of the worker who performed a matching task with the Universal Robot-3 (UR3) robot. UR3 is a famous intelligent and collaborative robot widely used in manufacturing environments. robot. UR3 is a famous intelligent and collaborative robot widely used in manufacturing environments. Figure 11. A worker was performing an experimental task.

Selection
Selection or picking is one of the fundamental tasks for the user interaction to perform a specific operation like pushing a button or an icon. Thus, we analyzed the task performance on the selection of virtual icons according to different visual contexts. Through this task, the subject can change the position and orientation of the robot directly. As shown in Figure 12, a virtual control panel for manipulating the UR3 robot is augmented on the see-through display of the AR smart glasses. Then, at the top of the virtual control panel, a required task is shown for the subject to select the icon either with a hand gesture or a touch gesture. The subject must understand the instruction and select the correct icon. For example, if 'Force +' or 'Speed MIN' are displayed, the subject has to select corresponding buttons on the see-through display. In the case of hand gesture-based interaction, the subject can choose the virtual icon directly with their finger. In the case of touch interaction, the user moves the cursor in the touchpad and selects the right button by touching the pad. This task continues until the subject selects all the required buttons to complete a required task.

Selection
Selection or picking is one of the fundamental tasks for the user interaction to perform a specific operation like pushing a button or an icon. Thus, we analyzed the task performance on the selection of virtual icons according to different visual contexts. Through this task, the subject can change the position and orientation of the robot directly. As shown in Figure 12, a virtual control panel for manipulating the UR3 robot is augmented on the see-through display of the AR smart glasses. Then, at the top of the virtual control panel, a required task is shown for the subject to select the icon either with a hand gesture or a touch gesture. The subject must understand the instruction and select the correct icon. For example, if 'Force +' or 'Speed MIN' are displayed, the subject has to select corresponding buttons on the see-through display. In the case of hand gesture-based interaction, the subject can choose the virtual icon directly with their finger. In the case of touch interaction, the user moves the cursor in the touchpad and selects the right button by touching the pad. This task continues until the subject selects all the required buttons to complete a required task. robot. UR3 is a famous intelligent and collaborative robot widely used in manufacturing environments. Figure 11. A worker was performing an experimental task.

Selection
Selection or picking is one of the fundamental tasks for the user interaction to perform a specific operation like pushing a button or an icon. Thus, we analyzed the task performance on the selection of virtual icons according to different visual contexts. Through this task, the subject can change the position and orientation of the robot directly. As shown in Figure 12, a virtual control panel for manipulating the UR3 robot is augmented on the see-through display of the AR smart glasses. Then, at the top of the virtual control panel, a required task is shown for the subject to select the icon either with a hand gesture or a touch gesture. The subject must understand the instruction and select the correct icon. For example, if 'Force +' or 'Speed MIN' are displayed, the subject has to select corresponding buttons on the see-through display. In the case of hand gesture-based interaction, the subject can choose the virtual icon directly with their finger. In the case of touch interaction, the user moves the cursor in the touchpad and selects the right button by touching the pad. This task continues until the subject selects all the required buttons to complete a required task.

Dragging and Dropping
A detailed step for the dragging and dropping task is shown in Figure 13. The dragging and dropping task is a 2D operation on the 3D plane. Thus, this task evaluates the performance of 2D-based Appl. Sci. 2019, 9, 3171 13 of 27 manipulation in the 3D space. The mission of this task is to move a visual object to the desired position by dragging and dropping a specified virtual icon or slide bar in the virtual control panel. As shown in Figure 13, the virtual control panel displays the type of a slide bar and its attribute value to be adjusted. Then, the subject has to understand the required task and to slide the bar to the desired value using a hand gesture. For example, for the UR3 robot, the worker can pull a slide bar to change the joint angle of the robot or to change the speed of the robot movement. Finally, this task is completed when the subject controls the slide bar to the desired location.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 13 of 27 A detailed step for the dragging and dropping task is shown in Figure 13. The dragging and dropping task is a 2D operation on the 3D plane. Thus, this task evaluates the performance of 2Dbased manipulation in the 3D space. The mission of this task is to move a visual object to the desired position by dragging and dropping a specified virtual icon or slide bar in the virtual control panel. As shown in Figure 13, the virtual control panel displays the type of a slide bar and its attribute value to be adjusted. Then, the subject has to understand the required task and to slide the bar to the desired value using a hand gesture. For example, for the UR3 robot, the worker can pull a slide bar to change the joint angle of the robot or to change the speed of the robot movement. Finally, this task is completed when the subject controls the slide bar to the desired location.

3D Matching Between a Virtual Object and its Corresponding Physical Object
The 3D matching task requires sophisticated user interaction, such as simultaneous 3D translation and rotation, to perform an exact matching between control and target objects in the wearable AR space. In this task, the control object is virtual, and the target object is real. In particular, to quantitatively evaluate the matching between them, an invisible virtual model of the physical target object is created and synchronized at the same position of the physical object. Therefore, the 3D matching task can be performed between a virtual control object and the invisible virtual model of a physical target object, as shown in Figure 14. As one of the fundamental and complicated manipulation tasks in wearable AR, this task was conducted to evaluate both visual contexts and interaction gestures. Note that 3D matching requires both 3D translation and 3D rotation so that it can cover selection and dragging and dropping tasks. However, it is essential to evaluate 2D-based tasks in wearable AR as 2D-based tasks are conducted most frequently.

3D Matching Between a Virtual Object and its Corresponding Physical Object
The 3D matching task requires sophisticated user interaction, such as simultaneous 3D translation and rotation, to perform an exact matching between control and target objects in the wearable AR space. In this task, the control object is virtual, and the target object is real. In particular, to quantitatively evaluate the matching between them, an invisible virtual model of the physical target object is created and synchronized at the same position of the physical object. Therefore, the 3D matching task can be performed between a virtual control object and the invisible virtual model of a physical target object, as shown in Figure 14. As one of the fundamental and complicated manipulation tasks in wearable AR, this task was conducted to evaluate both visual contexts and interaction gestures. Note that 3D matching requires both 3D translation and 3D rotation so that it can cover selection and dragging and dropping tasks. However, it is essential to evaluate 2D-based tasks in wearable AR as 2D-based tasks are conducted most frequently.
The subject has to recognize the position and orientation of the physical object and has to find the virtual object. After understanding the configuration, the subject selects a virtual control object with the hand gesture or touch gesture, then translates and rotates it, and finally matches it with the target object. In the hand gesture interaction, the control object is selected by conducting a pinch gesture (Figure 6b) with the index finger and thumb. Whenever the subject's hand is moved and rotated, the control object is also moved and rotated. In the case of touch interaction, the subject performs a translation and rotation operation separately since the touch interaction cannot support both operations simultaneously. For this reason, the translation mode and the rotation mode are provided. The task is completed when the control and target objects are matched geometrically. The subject has to recognize the position and orientation of the physical object and has to find the virtual object. After understanding the configuration, the subject selects a virtual control object with the hand gesture or touch gesture, then translates and rotates it, and finally matches it with the target object. In the hand gesture interaction, the control object is selected by conducting a pinch gesture (Figure 6b) with the index finger and thumb. Whenever the subject's hand is moved and rotated, the control object is also moved and rotated. In the case of touch interaction, the subject performs a translation and rotation operation separately since the touch interaction cannot support both operations simultaneously. For this reason, the translation mode and the rotation mode are provided. The task is completed when the control and target objects are matched geometrically.

Experimental Results
First, for the analysis of task performance depending on visual contexts, an ANOVA was performed for three visual augmentations. Note that most of the data satisfied ANOVA assumptions. However, some of them did not satisfy ANOVA assumptions (the assumption of homogeneity of variances) due to the limited number of experimental data. For this reason, we used Welch's ANOVA tests rather than general ANOVA tests through the study [46]. As shown in Figure 15, there was a significant difference (p < 0.05) for the 3D matching task, which is complicated and challenging. On the other hand, there was no statistical difference in the dragging and dropping task, which is rather a simple task. The non-stereo rendering and indirect augmentation with video background outperformed the stereo rendering and direct augmentation. The main reason for these results is that the 3D matching task requires much attention and sophisticated manipulation of the control object, so the FOV played an essential role in task performance. On the other hand, the FOV did not influence the performance of the dragging and dropping task very significantly.

Experimental Results
First, for the analysis of task performance depending on visual contexts, an ANOVA was performed for three visual augmentations. Note that most of the data satisfied ANOVA assumptions. However, some of them did not satisfy ANOVA assumptions (the assumption of homogeneity of variances) due to the limited number of experimental data. For this reason, we used Welch's ANOVA tests rather than general ANOVA tests through the study [46]. As shown in Figure 15, there was a significant difference (p < 0.05) for the 3D matching task, which is complicated and challenging. On the other hand, there was no statistical difference in the dragging and dropping task, which is rather a simple task. The non-stereo rendering and indirect augmentation with video background outperformed the stereo rendering and direct augmentation. The main reason for these results is that the 3D matching task requires much attention and sophisticated manipulation of the control object, so the FOV played an essential role in task performance. On the other hand, the FOV did not influence the performance of the dragging and dropping task very significantly. The qualitative analysis was conducted based on the validated ten questions from the SUS questionnaire [45]. Summarized ten questions in the SUS questionnaire include (1) frequent system use, (2) unnecessarily complex, (3) easy to use, (4) need technical support, (5) functions well integrated, (6) too much inconsistency, (7) easy to learn, (8) very cumbersome, (9) confident using the system, and (10) need to learn many things beforehand. However, original questions without any modification were given to the subject and used for the questionnaire evaluation. Readers are referred to see the questions in Reference 45. As shown in Figure 16, there were statistical differences in all questions except Q2 and Q4. Overall, the non-stereo rendering and indirect augmentation with video background showed better task performance than the stereo rendering and direct augmentation and the non-stereo rendering and indirect augmentation without video background. The non-stereo rendering and indirect augmentation without video background showed the worst task performance. The qualitative analysis was conducted based on the validated ten questions from the SUS questionnaire [45]. Summarized ten questions in the SUS questionnaire include (1) frequent system use, (2) unnecessarily complex, (3) easy to use, (4) need technical support, (5) functions well integrated, (6) too much inconsistency, (7) easy to learn, (8) very cumbersome, (9) confident using the system, and (10) need to learn many things beforehand. However, original questions without any modification were given to the subject and used for the questionnaire evaluation. Readers are referred to see the questions in Reference [45].
As shown in Figure 16, there were statistical differences in all questions except Q2 and Q4. Overall, the non-stereo rendering and indirect augmentation with video background showed better qualitative evaluation than the stereo rendering and direct augmentation and the non-stereo rendering and indirect augmentation without video background. The non-stereo rendering and indirect augmentation without video background showed the worst usability evaluation. The qualitative analysis was conducted based on the validated ten questions from the SUS questionnaire [45]. Summarized ten questions in the SUS questionnaire include (1) frequent system use, (2) unnecessarily complex, (3) easy to use, (4) need technical support, (5) functions well integrated, (6) too much inconsistency, (7) easy to learn, (8) very cumbersome, (9) confident using the system, and (10) need to learn many things beforehand. However, original questions without any modification were given to the subject and used for the questionnaire evaluation. Readers are referred to see the questions in Reference 45. As shown in Figure 16, there were statistical differences in all questions except Q2 and Q4. Overall, the non-stereo rendering and indirect augmentation with video background showed better task performance than the stereo rendering and direct augmentation and the non-stereo rendering and indirect augmentation without video background. The non-stereo rendering and indirect augmentation without video background showed the worst task performance. For the post hoc analysis concerning the visual context, in Q1 and Q9, there was no difference between the stereo rendering and direct augmentation and the non-stereo rendering and indirect augmentation, but there was a statistical difference between the non-stereo rendering and indirect augmentation without video background and the others concerning the visual context (p < 0.05). In particular, in Q3 and Q8, there was a significant difference between the three methods. The nonstereo rendering and indirect augmentation showed better easy-to-use and less cumbersome-to-use ratings. In Q6, Q7, and Q10, there was no difference between the non-stereo rendering and indirect augmentation and the stereo rendering and direct augmentation. We found that, although the stereo For the post hoc analysis concerning the visual context, in Q1 and Q9, there was no difference between the stereo rendering and direct augmentation and the non-stereo rendering and indirect augmentation, but there was a statistical difference between the non-stereo rendering and indirect augmentation without video background and the others concerning the visual context (p < 0.05). In particular, in Q3 and Q8, there was a significant difference between the three methods. The non-stereo rendering and indirect augmentation showed better easy-to-use and less cumbersome-to-use ratings. In Q6, Q7, and Q10, there was no difference between the non-stereo rendering and indirect augmentation and the stereo rendering and direct augmentation. We found that, although the stereo rendering and direct augmentation could match the virtual object with the real object and provide immersive visualization, its task performance was worse than the non-stereo rendering and indirect augmentation due to the narrower FOV. We found that the amount of information provided to the subject was a more influential factor than FOV in currently available AR smart glasses, although some mismatch problems did exist. Nevertheless, all the subjects agreed that visual augmentation using AR smart glasses is a promising tool for task assistance.
For the quantitative analysis, according to interaction gestures in wearable AR, a t-test was performed for task completion time (comparison between two methods). There was a statistical difference (p < 0.05) between interaction gestures in the 3D matching task, but there was no statistical difference in the selection task, as shown in Figure 17. In particular, in the case of hand gesture interaction, the 3D matching was six times faster than the multi-touch interaction using the touchpad.
AR smart glasses is a promising tool for task assistance.
For the quantitative analysis, according to interaction gestures in wearable AR, a t-test was performed for task completion time (comparison between two methods). There was a statistical difference (p < 0.05) between interaction gestures in the 3D matching task, but there was no statistical difference in the selection task, as shown in Figure 17. In particular, in the case of hand gesture interaction, the 3D matching was six times faster than the multi-touch interaction using the touchpad. For the qualitative analysis of the interaction method, the hand gesture interaction method showed better results than the multi-touch interaction method in all questions except Q4 (needtechnical-support), as shown in Figure 18. According to interviews, most participants mentioned that they could easily learn the method and thus could perform tasks more easily and effectively.  For the qualitative analysis of the interaction method, the hand gesture interaction method showed better results than the multi-touch interaction method in all questions except Q4 (need-technical-support), as shown in Figure 18. According to interviews, most participants mentioned that they could easily learn the method and thus could perform tasks more easily and effectively.
AR smart glasses is a promising tool for task assistance.
For the quantitative analysis, according to interaction gestures in wearable AR, a t-test was performed for task completion time (comparison between two methods). There was a statistical difference (p < 0.05) between interaction gestures in the 3D matching task, but there was no statistical difference in the selection task, as shown in Figure 17. In particular, in the case of hand gesture interaction, the 3D matching was six times faster than the multi-touch interaction using the touchpad. For the qualitative analysis of the interaction method, the hand gesture interaction method showed better results than the multi-touch interaction method in all questions except Q4 (needtechnical-support), as shown in Figure 18. According to interviews, most participants mentioned that they could easily learn the method and thus could perform tasks more easily and effectively.  We performed quantitative and qualitative analyses concerning visual contexts and gesture interactions. In particular, the most complicated and difficult task among the three tasks is 3D matching. 3D matching was conducted to evaluate both visual contexts and gesture interactions. For currently available AR smart glasses, such as Epson-Moverio, the non-stereo rendering and indirect augmentation method with video background using the hand gesture outperformed the other methods. One of the main reasons is that the AR smart glasses cannot provide wider FOV enough to provide with all the visual information using the stereo rendering and direct augmentation. We found that it would be more valuable to analyze the relation between gesture interactions and visual contexts. However, we have found that there are inherent limitations to implement them using Epson Moverio effectively. For this reason, further study is still needed to evaluate the relation between them using different AR smart glasses.

Comparative Evaluation of Different FOVs
We performed an additional comparative evaluation for the 3D matching task using AR smart glasses with different FOVs. This task was similar to the task performed in Section 4. The difference is that, in Section 4, FOV was defined based on visual contexts on the same see-through display. On the other hand, in Section 5, different AR smart glasses were used. HoloLens provides wider FOV based on the hardware specification.
Note that, in this experiment, the control object is physical, and the target object is virtual. In other words, the user holds a physical control object instead of a virtual control object and tries to move and rotate it for 3D matching with a virtual target object. One of the main reasons to perform this experiment is that sometimes workers must hold physical tools and parts to complete assembly and inspection operations in manufacturing environments. Visual guidance and virtual objects are useful to conduct the required task effectively. In this comparative evaluation, the indirect augmentation with and without video background was excluded since it showed worse performance than the others in the previous assessment. As mentioned in Section 4.4, this method showed the worst quantitative and qualitative evaluations. In particular, the task performance concerning the different sizes of FOV was measured for the stereo rendering and direct augmentation.

Experimental Setup and Design
This evaluation was conducted in the same environment as the first experiment. In addition to Epson Moverio BT-200, an MS HoloLens with a larger FOV was used to evaluate the stereo rendering and direct augmentation.
Another fifteen college students (twelve male and three female students) with an average age of 26.3 (SD = 1.5) participated in this user study. No subjects had previously experienced wearable AR. Before the experiment, the participants were also told about the visual interaction methods and related hardware and software according to the visual context. After that, tasks were performed in a random order for the three different methods.
The 3D matching task between the physical control and virtual target objects was conducted to evaluate the task performance, as shown in Figure 19. This experiment was designed to analyze how the user could recognize the position and rotation of the target object (virtual object) and how to accurately match the target object using the control object (physical object). The control and target objects have unique shapes, as shown in Figures 19 and 20. In particular, the shape of the physical target object used in this study was the same as that used in a previous research work [47], which studied the user perception of matching two objects.    When the user holds the physical control object and looks at it through the optical see-through display of the AR smart glasses, the virtual model is augmented on the physical control object, and the virtual target object is also visually augmented in the wearable AR space. For each experiment, the target object is located in an arbitrary location with a different orientation. After the user recognizes the position and orientation of the target object through the smart glasses, they translate and rotate the physical object to match with the target object. When the control and target objects are matched, the task is completed. The condition for the successful matching is based on the translation distance and orientation distance between two objects. When both differences are within the given tolerance, the task is completed.

Experimental Results
A comparative evaluation of visual interaction according to the visual context using different smart glasses was performed. A Welch's ANOVA was performed on the completion time for task execution. The analysis verified that there was a significant difference (p < 0.05) in task completion according to the visual context and FOV. According to the post hoc analysis, there was no statistical difference between the stereo rendering and direct augmentation method with a small FOV and the When the user holds the physical control object and looks at it through the optical see-through display of the AR smart glasses, the virtual model is augmented on the physical control object, and the virtual target object is also visually augmented in the wearable AR space. For each experiment, the target object is located in an arbitrary location with a different orientation. After the user recognizes the position and orientation of the target object through the smart glasses, they translate and rotate the physical object to match with the target object. When the control and target objects are matched, the task is completed. The condition for the successful matching is based on the translation distance and orientation distance between two objects. When both differences are within the given tolerance, the task is completed.

Experimental Results
A comparative evaluation of visual interaction according to the visual context using different smart glasses was performed. A Welch's ANOVA was performed on the completion time for task execution. The analysis verified that there was a significant difference (p < 0.05) in task completion according to the visual context and FOV. According to the post hoc analysis, there was no statistical difference between the stereo rendering and direct augmentation method with a small FOV and the non-stereo rendering and indirect augmentation with video background method, but there was a considerable difference in the stereo rendering and direct augmentation method with a larger FOV (p < 0.05), as shown in Figure 21.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 19 of 27 non-stereo rendering and indirect augmentation with video background method, but there was a considerable difference in the stereo rendering and direct augmentation method with a larger FOV (p < 0.05), as shown in Figure 21.

3D Matching Between Virtual Control and Physical Target Objects (Revisited)
The completion time using the stereo rendering and direct augmentation in the first experiment ( Figure 15) took longer than the non-stereo rendering and indirect augmentation with a small FOV. However, there was no significant difference between the two augmentations in the second evaluation ( Figure 21). We assumed that the different result was caused by whether the control object was real or virtual. To confirm this assumption, an additional experiment was performed by ten participants (male: six, female: four, mean age: 29.3 years old, SD: 4.9) [44]. The additional experiment was performed with 3D matching in the same environment as the second experiment. However, unlike the second experiment, a virtual object was used as a control object to match the physical object like in the first experiment. A quantitative analysis of completion time was performed. We found that there was a statistically significant difference between the two augmentations, as shown in Figure 22. The stereo rendering and direct augmentation method had approximately double the completion time of the non-stereo rendering and indirect augmentation method (p < 0.05). Note that we obtained similar results for task performance with the virtual control object for 3D matching in Section 4. The conclusion based on the results of the two experiments is as follows: Unlike the virtual object, when the physical control object is used, it is not necessary for the user to continuously look at the 3D matching process as the user can intuitively manipulate the physical control object. On the other hand, the user must look at the virtual control object continuously to perform the 3D matching task.

3D Matching Between Virtual Control and Physical Target Objects (Revisited)
The completion time using the stereo rendering and direct augmentation in the first experiment ( Figure 15) took longer than the non-stereo rendering and indirect augmentation with a small FOV. However, there was no significant difference between the two augmentations in the second evaluation ( Figure 21). We assumed that the different result was caused by whether the control object was real or virtual. To confirm this assumption, an additional experiment was performed by ten participants (male: six, female: four, mean age: 29.3 years old, SD: 4.9) [44]. The additional experiment was performed with 3D matching in the same environment as the second experiment. However, unlike the second experiment, a virtual object was used as a control object to match the physical object like in the first experiment. A quantitative analysis of completion time was performed. We found that there was a statistically significant difference between the two augmentations, as shown in Figure 22. The stereo rendering and direct augmentation method had approximately double the completion time of the non-stereo rendering and indirect augmentation method (p < 0.05). Note that we obtained similar results for task performance with the virtual control object for 3D matching in Section 4. The conclusion based on the results of the two experiments is as follows: Unlike the virtual object, when the physical control object is used, it is not necessary for the user to continuously look at the 3D matching process as the user can intuitively manipulate the physical control object. On the other hand, the user must look at the virtual control object continuously to perform the 3D matching task.

Real Use Case Analysis for Industrial Workers
The experiments in Section 4 and 5 were performed in a controlled laboratory environment. However, because there might be a difference between the controlled environment and the actual manufacturing environment, an additional qualitative evaluation was performed on industrial

Real Use Case Analysis for Industrial Workers
The experiments in Sections 4 and 5 were performed in a controlled laboratory environment. However, because there might be a difference between the controlled environment and the actual manufacturing environment, an additional qualitative evaluation was performed on industrial workers of a small and medium-sized company that assembles air conditioners and heaters, as shown in Figure 23.

Real Use Case Analysis for Industrial Workers
The experiments in Section 4 and 5 were performed in a controlled laboratory environment. However, because there might be a difference between the controlled environment and the actual manufacturing environment, an additional qualitative evaluation was performed on industrial workers of a small and medium-sized company that assembles air conditioners and heaters, as shown in Figure 23. For this real use case analysis, seven industrial workers (average age: 29.7 years old, SD: 3.59) participated in the experiment, and none of them had any prior knowledge of smart glasses. First, we explained the purpose of the experiment to the workers and presented the concept, methods, and functionality of devices, such as smart glasses, visual context, and interaction methods. Then, the workers experienced the task assistance manual required for the actual assembly through smart glasses. Besides, we showed related AR manufacturing services through videos and animations.
After the experiment, the workers were asked to fill out the questionnaire. We analyzed the cognitive load in performing manufacturing tasks since the real industrial field is different from a controlled environment, and wearable AR should not provide additional cognitive burden. We conducted another qualitative evaluation based on the validated questions from NASA task load index (NASA-TLX) (1: very low, 7: very high) [48]. Questions in the NASA-TLX questionnaire include (1) required-mental and perceptual activity, (2) required-physical activity, (3) time pressure felt, (4) hard to work, (5) successful in accomplishing goals, and (6) stressed. We also used these questions without any modifications. Details are shown in Reference 48. According to the questionnaire analysis ( Figure 24), subjects mentioned that they could perform given tasks with less cognitive load with the help of wearable AR. They were satisfied with the visualization and interaction of Figure 23. Industrial workers were experiencing task assistance through AR smart glasses.
For this real use case analysis, seven industrial workers (average age: 29.7 years old, SD: 3.59) participated in the experiment, and none of them had any prior knowledge of smart glasses. First, we explained the purpose of the experiment to the workers and presented the concept, methods, and functionality of devices, such as smart glasses, visual context, and interaction methods. Then, the workers experienced the task assistance manual required for the actual assembly through smart glasses. Besides, we showed related AR manufacturing services through videos and animations.
After the experiment, the workers were asked to fill out the questionnaire. We analyzed the cognitive load in performing manufacturing tasks since the real industrial field is different from a controlled environment, and wearable AR should not provide additional cognitive burden. We conducted another qualitative evaluation based on the validated questions from NASA task load index (NASA-TLX) (1: very low, 7: very high) [48]. Questions in the NASA-TLX questionnaire include (1) required-mental and perceptual activity, (2) required-physical activity, (3) time pressure felt, (4) hard to work, (5) successful in accomplishing goals, and (6) stressed. We also used these questions without any modifications. Details are shown in Reference [48]. According to the questionnaire analysis ( Figure 24), subjects mentioned that they could perform given tasks with less cognitive load with the help of wearable AR. They were satisfied with the visualization and interaction of manufacturing information using AR smart glasses. Some of them commented that wearable AR was very effective when used for training or task assistance with beginners.
However, although AR smart glasses were useful for task assistance, the workers also mentioned that currently available AR smart glasses were heavy to wear, and the screen was not visible in bright places. Also, for the visual context, they preferred the stereo rendering and direct augmentation rather than the non-stereo rendering and indirect augmentation as there was no mismatch between the physical and virtual objects. However, it was suggested that improvement should be made to the fact that much information is not visible at any given time. For the interaction method, they commented that it would be easier to perform the touch interaction in the process of performing simple tasks. However, they mentioned that using two hands was effective in the actual manufacturing environment.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 21 of 27 manufacturing information using AR smart glasses. Some of them commented that wearable AR was very effective when used for training or task assistance with beginners. However, although AR smart glasses were useful for task assistance, the workers also mentioned that currently available AR smart glasses were heavy to wear, and the screen was not visible in bright places. Also, for the visual context, they preferred the stereo rendering and direct augmentation rather than the non-stereo rendering and indirect augmentation as there was no mismatch between the physical and virtual objects. However, it was suggested that improvement should be made to the fact that much information is not visible at any given time. For the interaction method, they commented that it would be easier to perform the touch interaction in the process of performing simple tasks. However, they mentioned that using two hands was effective in the actual manufacturing environment.

Discussion on User Studies and Suggestions for User-Centric Task Assistance
The experimental results have proved that the task performance of the user interaction is influenced by both the visual context and the gesture interaction.

Evaluation for the Visual Context
For the 3D matching task concerning the visual context in the first user study of Section 4, the stereo rendering and direct augmentation method took approximately twice as long to complete the task compared to the non-stereo rendering and indirect augmentation. The reason for this is that the non-stereo rendering and indirect augmentation can provide more information than the stereo rendering and direct augmentation due to the broader FOV through the camera than through the optical display. Due to this limitation, in the stereo rendering and direct augmentation, the user had to move their head frequently to check the location of the target object to select, translate, and rotate the virtual control object.
In the second user study in Section 5, the stereo rendering and direct augmentation with a wider FOV (but still similar or narrower than a camera) showed the best task performance. In this experimental evaluation, the stereo rendering and direct augmentation with a larger FOV made it possible for the user to recognize the position and orientation of the target object easily, and the matching task with the control object was more effectively performed. In particular, if the FOV of the stereo rendering and direct augmentation and that of the non-stereo rendering and indirect augmentation were similar, the stereo rendering and direct augmentation was more effective than

Discussion on User Studies and Suggestions for User-Centric Task Assistance
The experimental results have proved that the task performance of the user interaction is influenced by both the visual context and the gesture interaction.

Evaluation for the Visual Context
For the 3D matching task concerning the visual context in the first user study of Section 4, the stereo rendering and direct augmentation method took approximately twice as long to complete the task compared to the non-stereo rendering and indirect augmentation. The reason for this is that the non-stereo rendering and indirect augmentation can provide more information than the stereo rendering and direct augmentation due to the broader FOV through the camera than through the optical display. Due to this limitation, in the stereo rendering and direct augmentation, the user had to move their head frequently to check the location of the target object to select, translate, and rotate the virtual control object.
In the second user study in Section 5, the stereo rendering and direct augmentation with a wider FOV (but still similar or narrower than a camera) showed the best task performance. In this experimental evaluation, the stereo rendering and direct augmentation with a larger FOV made it possible for the user to recognize the position and orientation of the target object easily, and the matching task with the control object was more effectively performed. In particular, if the FOV of the stereo rendering and direct augmentation and that of the non-stereo rendering and indirect augmentation were similar, the stereo rendering and direct augmentation was more effective than the non-stereo rendering and indirect augmentation due to the visual matching between virtual and physical artifacts. On the other hand, the non-stereo rendering and indirect augmentation allowed the physical objects to be visually overlapped only through the display of the smart glasses. Thus, the video image was mismatched with physical objects, confusing the user while they performed their tasks.
Depending on the role of the control and target objects for the 3D matching task in the first and second experiments, there was a difference in task performance. The reason for this result is related to whether the control object with which the user interacts is virtual or real. For the 3D matching in the first user study, the control object was virtual. Thus, when the user moved and rotated the control object, the control object always had to be in the user's field of view. However, as the control object in the second evaluation was real, it was not necessary to place the control object in the user's field of view in the AR smart glasses. That is, when the user recognized the location and orientation of the target object while holding the control object, the task could be conducted without looking at the control object.
According to the qualitative analysis and the interview for the visual context-based methods, the stereo rendering and direct augmentation method provides a promising direction for wearable AR as it can support almost exact matching between the virtual and physical objects in wearable AR. However, some participants mentioned that the stereo rendering and direct augmentation also had a limitation in providing enough information due to the narrow FOV, which prevented them from performing the given tasks naturally. In the non-stereo rendering and indirect augmentation with the video background, physical objects were duplicated through the display of the smart glasses. Interestingly, some users performed tasks just by looking only at the display of the smart glasses in the video-based augmentation. On the other hand, the non-stereo rendering and indirect augmentation without background was not suitable for the task assistance as there was much difference between the virtual object and the real environment. We also found similar results from industrial workers in a real industrial environment. Therefore, the analysis and evaluation results will be effectively utilized in designing and implementing manufacturing services using AR smart glasses.

Evaluation for the Gesture Interaction
For the 3D virtual object matching task concerning different gesture interactions, the hand gesture interaction demonstrated faster task completion time than the multi-touch interaction. The faster task completion is due to the concurrency control of the interaction (selection, translation, rotation, etc.) and intuitiveness of using a real hand to perform the task. Since the hand gesture-based interaction provides the same metaphor as the user performs typical tasks in the real environment, the user can simultaneously perform the translation and rotation without a separate mode change through the selection of the menu using their hand. On the other hand, in the case of the multi-touch interaction, whenever a control object is selected through the touchpad of smart glasses or the transformation needs to be changed, this method requires a lot of interaction with the gizmo. Furthermore, the user must change the mode to select and manipulate the control object whenever a different transformation is required.
Interviews also showed that interaction with hand gestures was more natural and effective. However, it was uncomfortable to wear another sensor for detecting hand gestures. On the other hand, the interaction using the multi-touch interaction made sophisticated tasks difficult because the mode had to be converted sequentially and frequently (for example, in order of selection → object translation → object rotation) to complete a specific interaction. As a result, it was found that the satisfaction level was lower than that of the hand gesture.
There was no statistical difference between different gesture interactions in 2D interactions, such as selection. Interestingly, when performing the 2D-based tasks, some users moved their heads to specific locations and then clicked menus rather than moving the cursor with the touchpad.

Challenges for Supporting User-Centric Task Assistance
Based on the evaluation results, we discussed some considerations and challenges for user-centric task assistances using AR smart glasses.

Visual Layout for Cognitive Information Visualization
It has been shown that due to the narrow FOV of currently available AR smart glasses, the necessary information cannot be sufficiently visualized on the display of smart glasses, and the augmented information may be out of the FOV of the screen. However, for smart glasses to support a wider FOV, their cost must increase.
For this reason, for currently available AR smart glasses, it is necessary to consider a dynamic visual layout for the overlaid information to make it easier for the user to recognize and understand the information. One of the important considerations for this purpose is to analyze the relationship and priority between visual AR information. Usually, the worker requires a different level of visual information to perform given tasks. Thus, it is essential to provide specific information that is more important for performing the tasks based on the relationship and priority of the AR information. The visual layout, such as the Fisheye view [49,50], calculates the degree of interest of the augmented information and increases the visual focus of the information as the priority is higher. Thus, by providing a more appropriate and important information-focused view by considering worker situations, it is possible to provide more effective task support to workers.

Hybrid User Interaction
Visual components augmented in a wearable AR environment are composed of various forms, such as a menu, a widget, and a 3D model. According to the experimental evaluation in this research, the multi-touch interaction is feasible for manipulating 2D virtual objects, e.g., for selection and dragging. On the other hand, the hand gesture interaction is suitable for manipulating 3D models, e.g., for translating and rotating. Therefore, a hybrid user interface can simultaneously support multi-touch interaction to perform various 2D tasks and hand gesture interaction to perform more complex 3D tasks. Besides, as smart glasses evolve, research on interfaces for interaction will be steadily conducted, and more effective interactions will be possible for the hybrid user interface.

Enhanced Hand Gesture Interaction
In a wearable AR environment, the user interacts with physical objects, as well as virtual objects, through AR interactions in the physical environment. However, for a virtual control object, the position and orientation of the virtual object are tracked through the camera, and the virtual object should be carefully manipulated with the hand gesture interaction, while the virtual object is augmented within the FOV. For this reason, to manipulate a virtual control object, the user must always look at it. On the other hand, a physical control object does not need to be viewed during the interaction. Therefore, there is a difference between the interaction with the real object and the interaction with the virtual object.
For this reason, we require a new method for supporting more effective manipulation of virtual objects compared with that of physical objects. One way to do this is to track the movement of the user's hand with additional sensors. The translation and rotation of virtual objects can be inferred even if the user does not look at the virtual object through the movement of their hand. For example, after capturing a virtual object through a hand gesture interaction, the moving speed and acceleration information of the user's hand with IMU is tracked to calculate the movement of the virtual object [51]. In this way, it is possible to interact with a virtual object more easily and more like the interaction with a real object.

Conclusions
As many AR smart glasses are currently available [6], it is essential to make visual augmentation more natural and intuitive in terms of visual contexts and gesture interaction. However, there has been little research work on investigating this issue in depth. In this study, user interactions based on visual contexts and gesture interactions were classified and evaluated using AR smart glasses. We performed quantitative and qualitative analyses, including performance measurement and questionnaire evaluation. The results verified that there was a significant difference in task performance according to visual contexts and gesture interactions as the task complexity increased. They also showed that both FOV and visual registration between virtual and physical artifacts were important.
We have analyzed not only the interaction between virtual objects but also the interaction between physical and virtual objects. Depending on the role of the control and target objects for the 3D matching task, there was a difference in task performance. The reason for this result is related to whether the control object with which the user interacts is virtual or real.
We have also found that both the FOV and the gesture interaction influence task performance, and they can complement each other. However, currently available smart glasses have narrower FOV than is necessary, and the visual matching assistance is not sufficient for use in real-world tasks. Although interviews showed that interaction with hand gestures was more natural and effective, it was uncomfortable to wear another sensor for detecting hand gestures. Nevertheless, we have demonstrated the potential utilization of AR smart glasses. To overcome still existing limitations on the visual context and gesture interaction, it is necessary to provide a dynamic visual layout concerning narrow FOV and to support hybrid user interfaces for more accurate and natural interaction.
Actual manufacturing systems may have different influential factors compared with the controlled laboratory, although a real use case study was performed qualitatively using industrial workers. Therefore, further study is required to analyze the visual context and interaction methods in different real manufacturing environments. For this reason, we plan to apply the proposed approach to a smart factory testbed that produces motors and to evaluate its performance quantitatively and qualitatively. Secondly, it is necessary to evaluate the relation between gesture interactions and visual contexts using different AR smart glasses. Thirdly, we need to improve the accuracy of the hand gesture and visual registration in wearable AR. Fourthly, we need further investigation of different types of AR smart glasses for analyzing visual contexts and gesture interactions concerning hardware specification and price. Finally, it would be valuable for the visualization and interaction methods to be extended to adopt some of the challenges described previously.