Effects on Co-Presence of a Virtual Human: A Comparison of Display and Interaction Types

: Recently, artiﬁcial intelligence (AI)-enabled virtual humans have been widely used in various ﬁelds in our everyday lives, such as for museum exhibitions and as information guides. Given the continued technological innovations in extended reality (XR), immersive display devices and interaction methods are evolving to provide a feeling of togetherness with a virtual human, termed co-presence. With regard to such technical developments, one main concern is how to improve the experience through the sense of co-presence as felt by participants. However, virtual human systems still have limited guidelines on effective methods, and there is a lack of research on how to visualize and interact with virtual humans. In this paper, we report a novel method to support a strong sense of co-presence with a virtual human, and we investigated the effects on co-presence with a comparison of display and interaction types. We conducted the experiment according to a speciﬁed scenario between the participant and the virtual human, and our experimental study showed that subjects who participated in an immersive 3D display with non-verbal interaction felt the greatest co-presence. Our results are expected to provide guidelines on how to focus on constructing AI-based interactive virtual humans.


Introduction
In the days ahead, the influence of virtual reality (VR) and augmented reality (AR), commonly referred to as extended reality (XR) in brief, is expected to grow rapidly. XR has received significant attention as a key technology for education, training, gaming, advertising, and shopping given its ability to add effective information for our everyday lives [1,2]. This technology can create various situations with useful datasets and interactions to improve a user's experience [3]. In particular, it is actively used for situational training in place of actual people [4]. More recently, virtual humans-computer-generated characters-in XR were developed to act as guides to include real-world meaningful information combined with artificial intelligence (AI) technology. For instance, interactive virtual humans can empower users to collaborate in three-dimensional immersive tele-conferencing, during which one can communicate with a teleported remote other as if present [5,6]. Additionally, after reconstructing a virtual human through a captured real human, it can be used for the purpose of receiving responses with a participant, such as a virtual human who answers questions [7]. Moreover, with the development of collaborative virtual environments (CVEs), online games, and the increasing popularity of metaverses, virtual humans are becoming even more important, and this technology allows participants to interact with virtual humans as if they are actually present in the same room [8,9]. Despite such provisions, virtual human systems still have limited guidelines on effective methods, and there is a lack of research on how to visualize and interact with virtual humans. In particular, it is necessary to increase the sense of co-presence, i.e., the feeling of being together, via effective visualization and interaction methods with perceived virtual humans in XR environments. 2 of 14 In this line of thinking, herein, we propose a method to increase and improve the copresence of virtual humans that starts with a comparison of different visual and interaction types. For example, imagine a situation in which a user is wearing a helmet-type immersive device; we need to provide a suitable interaction method, such as dialogue and gestures, for greater co-presence. It would be best to provide all cases of interactions for the XR virtual human; in some cases, due to given device setup environments and computing and development resources, we will need to know which interaction method is more effective. Our paper especially focuses on display and interaction elements for co-presence (e.g., 2D vs. 3D, verbal vs. non-verbal). We note that the main XR technologies consist of visualization, interaction, and simulation. Visualization and interaction are related to the setup of the XR virtual human within these technologies [10]. Thus, for an effective setup of a virtual human, we performed comparative experiments on different display and interaction types using a virtual human, and our experimental study with subjects participating in an immersive 3D display using a head-mounted display (HMD) showed that gesture-based non-verbal interaction is most effective. Additionally, for a 2D XR environmental situation, it was found that dialogue-based verbal interaction is more important. Figure 1 shows typical interaction situations with a virtual human. The participant labeled "Real Human" in Figure 1 can communicate and interact via inputs such as voice and gestures with the perceived virtual human in the given display system, and the virtual human can provide appropriate response feedback such as facial expressions, answers, and interactive animation behaviors. In this case, a sensing device consisting of an integrated depth camera is mainly used to recognize the participant's voice and gestures. Note that the depth camera has pixel values that correspond to the distance, allowing the extraction and estimation of the human's body skeleton for gesture recognition; it also contains controllers for voice interactions [11]. In this case, we need to know which combinations can increase co-presence. This is the focus of this paper. the feeling of being together, via effective visualization and interaction methods with perceived virtual humans in XR environments.
In this line of thinking, herein, we propose a method to increase and improve the copresence of virtual humans that starts with a comparison of different visual and interaction types. For example, imagine a situation in which a user is wearing a helmet-type immersive device; we need to provide a suitable interaction method, such as dialogue and gestures, for greater co-presence. It would be best to provide all cases of interactions for the XR virtual human; in some cases, due to given device setup environments and computing and development resources, we will need to know which interaction method is more effective. Our paper especially focuses on display and interaction elements for copresence (e.g., 2D vs. 3D, verbal vs. non-verbal). We note that the main XR technologies consist of visualization, interaction, and simulation. Visualization and interaction are related to the setup of the XR virtual human within these technologies [10]. Thus, for an effective setup of a virtual human, we performed comparative experiments on different display and interaction types using a virtual human, and our experimental study with subjects participating in an immersive 3D display using a head-mounted display (HMD) showed that gesture-based non-verbal interaction is most effective. Additionally, for a 2D XR environmental situation, it was found that dialogue-based verbal interaction is more important. Figure 1 shows typical interaction situations with a virtual human. The participant labeled "Real Human" in Figure 1 can communicate and interact via inputs such as voice and gestures with the perceived virtual human in the given display system, and the virtual human can provide appropriate response feedback such as facial expressions, answers, and interactive animation behaviors. In this case, a sensing device consisting of an integrated depth camera is mainly used to recognize the participant's voice and gestures. Note that the depth camera has pixel values that correspond to the distance, allowing the extraction and estimation of the human's body skeleton for gesture recognition; it also contains controllers for voice interactions [11]. In this case, we need to know which combinations can increase co-presence. This is the focus of this paper. Display and interaction types with the virtual human and an example of a user's interaction with a virtual human: our figure presents a scene in which a user's gesture-based non-verbal interaction is given to the virtual human in an immersive 3D XR environment using a head-mounted display. Display and interaction types with the virtual human and an example of a user's interaction with a virtual human: our figure presents a scene in which a user's gesture-based non-verbal interaction is given to the virtual human in an immersive 3D XR environment using a head-mounted display.
The remainder of our paper is organized as follows. Section 2 reviews works related to our research. Section 3 provides a system overview, and Section 4 describes the detailed implementation. Section 5 provides the experimental setup and procedure for evaluating co-presence in the system configuration. Section 6 reports the main results and provides Electronics 2022, 11, 367 3 of 14 a discussion. Finally, Section 7 summarizes the paper and concludes with directions for future works.

Interactive Virtual Human
Life-sized virtual humans made to resemble actual humans have been widely used as an important media technology in our everyday life [12]. A virtual human can use language in an interactive way with appropriate gestures and can show emotional reactions to verbal and non-verbal stimuli [13]. Additionally, the virtual human can provide face-to-face communication and interaction with a remote teleport avatar [14]. A useful example with virtual humans is to apply them in training scenarios such as interviews and medical team training [15,16]. Another notable example using interactive virtual humans is museum application scenarios [17]. When a visitor enters the exhibit space, the virtual human appears on a stage in the museum room and can explain key necessary information, in the same way as a museum guide. Then, the visitor can ask questions and obtain answers from the virtual human. More recently, many researchers have been concerned with finding an intuitive way for virtual humans to control physical objects in augmented reality (AR) combining the real world and virtual humans [18,19]. It remains to be determined how these research results should be configured to achieve the highest level of co-presence with the virtual human to enhance the effect.

Display Types for the Virtual Human
In a few more recent results, many researchers have introduced XR displays to visualize a life-sized virtual human, such as head-worn displays, multi-projection for immersive environments, and auto-stereoscopic 3D displays [20,21]. As a future teleconference service scenario, multiple depth cameras and a see-through head-mounted display (named Holoportation) allowing precise 3D virtual human models to be transmitted to a remote site in real time were introduced by Escolano et al. [22]. Shin and Jo suggested a mixed-reality human (MRH) system in which the virtual human is combined with physicality as the real part [23]. According to recent requirements, Hologram showcased a holographic virtual stage and a holographic virtual character to make family life more convenient and the environment more intelligent [24]. Our work was designed to further improve upon pioneering works by comparing different types of displays in relation to the effects of interaction with a virtual human.

Interaction Types for the Virtual Human
A virtual human is an output form of computer systems that strive to engage with participants through natural language interfaces (or verbal interaction) and non-verbal interaction, which can also be performed with facial expressions and gestures [25,26], and the virtual human provides a response to the real person who interacts with it. Our work investigated such issues where it is desirable that the virtual human mimics the behavior of a real person.
Image-based full-body capture technology refers to three-dimensional reconstruction of the appearance of the body, hands, and face from image data, and this can also capture human movements and changes. With the recent development of deep learning technology and the rapid rise of XR, full-body capture technology can easily and quickly be applied to dynamic virtual humans [27]. In particular, with the development of deep neural network (DNN) algorithms, it has reached the level of estimating three-dimensional body poses and shapes for interaction [28,29]. However, there are still further research topics to be investigated for virtual human interaction, such as effective motion and feedback. In this paper, our focus is on finding an effective configuration among the various interaction types in terms of co-presence.

Co-Presence for the Virtual Human
There have been a few previous attempts to evaluate co-presence levels when interacting with a virtual human in various XR environments. As a representative research result, Mathis et al. explored the perception of virtual humans in virtual environments, and they presented the impact of virtual human's fidelity with respect to interaction [30]. According to another study, Slater et al. conducted an experiment in which participants responded appropriately to a negative or positive virtual audience, and the result showed virtual humans are more effective when perceived as humans [31]. An experiment by Zanbaka et al. was reported in which participants were inhibited by the co-presence (being together) of virtual humans in performing a task [32]. Additionally, Jo et al. presented the effects of the type of virtual human and the background representation on the participants' co-presence with respect to the design of future virtual humans [33]. Then, co-presence is defined as feeling that the other participants in the virtual environment exist [34].
To measure the user's feeling of being together, called co-presence, with the virtual human, the previous studies usually conducted surveys via questionnaire [34,35]. More recently, physiological measurements of parameters such as heart rate and skin conductance were used to investigate the effects on co-presence under different XR conditions [36,37]. However, most previous works have focused on the shape of the virtual human, such as gaze direction and animation qualities, and no comprehensive work has been done in terms of the effect of different display and interaction types. Thus, our work investigated such issues regarding visual and interaction components. Table 1 shows several examples of test conditions for our experiment to determine their effects on co-presence with the virtual human (e.g., various types of display, such as a head-mounted display, a large display to visualize a life-scaled virtual human, and multiple projectors, as well as verbal or non-verbal interaction methods). Usually, participants are in a VR/AR system with a virtual human experience stereoscopic visualization; more recently, large 2D displays have been used to overcome the inconvenience of wearing a headmounted display (HMD) [23]. As another issue, when interacting with a virtual human, just as people communicate with each other in everyday life, participants can use their voice as well as gesture and facial expressions. Interaction methods are divided into conversationbased verbal methods and non-verbal characteristics that do not depend on dialogue. In our system, the virtual human was designed to provide verbal communication or non-verbal interaction in the given display systems, such as the 2D mono or 3D stereoscopic types, and the virtual human could automatically give the participant appropriate response feedback, such as facial expressions, answers, and interactive animation behaviors, according to the participant's input. For example, Figure 2 shows our interaction handling process between the user and the virtual human through the recognition of various verbal and non-verbal inputs to establish the virtual human's response [19]. For example, to determine the user's gesture, the user's specific joint position information is extracted from the depth camera and the user's joint position information is then calculated. Subsequently, the system recognizes whether the user performed a gesture at a specific point. At this time, if the score of the highest motion among the pre-trained candidate motions is higher than a pre-defined threshold (T), we determine that the user performed the corresponding gesture operation. In addition, cases involving other motions or an absence of motions are set to "None", and a case with a score lower than T is recognized as "None" as well.

Facial Expression
Non-verbal For example, Figure 2 shows our interaction handling process between the user the virtual human through the recognition of various verbal and non-verbal inputs to tablish the virtual human's response [19]. For example, to determine the user's gest the user's specific joint position information is extracted from the depth camera and user's joint position information is then calculated. Subsequently, the system recogn whether the user performed a gesture at a specific point. At this time, if the score of highest motion among the pre-trained candidate motions is higher than a pre-defi threshold (T), we determine that the user performed the corresponding gesture operat In addition, cases involving other motions or an absence of motions are set to "None", a case with a score lower than T is recognized as "None" as well.

Implementation
To evaluate co-presence with virtual humans, first we developed an authoring too to edit the virtual human's responses to apply the user's interactions [38]. The virtual man needs to generate an interaction response, and it should be possible to control the sync motion with the voice to make conversational situations and gesture-based ani tion to create appropriate behaviors matching those of a real human. Thus, our autho toolkit to provide the virtual human controller allows us to create the virtual hum responses (e.g., voice-based verbal conversation, facial expressions, and gesture-ba non-verbal feedback) and realize a context-aware virtual human for our everyday l

Implementation
To evaluate co-presence with virtual humans, first we developed an authoring toolkit to edit the virtual human's responses to apply the user's interactions [38]. The virtual human needs to generate an interaction response, and it should be possible to control the lip-sync motion with the voice to make conversational situations and gesture-based animation to create appropriate behaviors matching those of a real human. Thus, our authoring toolkit to provide the virtual human controller allows us to create the virtual human's responses (e.g., voice-based verbal conversation, facial expressions, and gesturebased non-verbal feedback) and realize a context-aware virtual human for our everyday lives [39]. In other words, it provides a quick pipeline to build interactive virtual humans to support verbal and non-verbal situations (see Figure 3).
In test conditions for immersive 3D displays, the subject wore a head-mounted display (HTC VIVE [40]) to view and interact with the virtual human. On the other hand, we installed the virtual human on a large TV display (an 85-inch screen in portrait mode) for 2D situations to show the same life-sized virtual human that would be presented in a 3D immersive display. The motion datasets of the virtual human, such as the appearances of its facial expressions, its animated motions, and lip-synching with prerecorded sounds, were configured with resemblance to an actual person using Mixamo and Oculus LipSync in Unity3D [41,42]. 021, 10, x FOR PEER REVIEW 6 of 14 [39]. In other words, it provides a quick pipeline to build interactive virtual humans to support verbal and non-verbal situations (see Figure 3). In test conditions for immersive 3D displays, the subject wore a head-mounted display (HTC VIVE [40]) to view and interact with the virtual human. On the other hand, we installed the virtual human on a large TV display (an 85-inch screen in portrait mode) for 2D situations to show the same life-sized virtual human that would be presented in a 3D immersive display. The motion datasets of the virtual human, such as the appearances of its facial expressions, its animated motions, and lip-synching with prerecorded sounds, were configured with resemblance to an actual person using Mixamo and Oculus LipSync in Unity3D [41,42].
In real-time user interaction situations with verbal and non-verbal inputs, the virtual human can recognize the user contexts based on fuzzy logic to match the optimal responses to users, and the virtual human's behavior was set up using fuzzy inference [43]. Fuzzy logic is a form of many-valued logic in which the truth values of variables can be any real number between 0 and 1, inclusive. We used a depth camera and a multi-array microphone to capture the user's voice and body gestures for the same sound conditions, with adjustments made so that sound of the same volume and sound quality was output from the same speaker. To find the user's body motion, the process of obtaining skeleton information was carried out based on specific joints of the user [44]. Figure 4 shows examples of a virtual human's lip movements when providing verbal responses. With audio input streams with the user's voice from the installed multi-array microphone, we predicted the lip movements and facial expressions that correspond to a particular voice and used a mouthpiece to animate the virtual human [42]. In real-time user interaction situations with verbal and non-verbal inputs, the virtual human can recognize the user contexts based on fuzzy logic to match the optimal responses to users, and the virtual human's behavior was set up using fuzzy inference [43]. Fuzzy logic is a form of many-valued logic in which the truth values of variables can be any real number between 0 and 1, inclusive. We used a depth camera and a multi-array microphone to capture the user's voice and body gestures for the same sound conditions, with adjustments made so that sound of the same volume and sound quality was output from the same speaker. To find the user's body motion, the process of obtaining skeleton information was carried out based on specific joints of the user [44]. Figure 4 shows examples of a virtual human's lip movements when providing verbal responses. With audio input streams with the user's voice from the installed multi-array microphone, we predicted the lip movements and facial expressions that correspond to a particular voice and used a mouthpiece to animate the virtual human [42].  To express the whole-body motion of the virtual human, we selected candidates mapped to the suggested authoring toolkit that dynamically respond to changes in the actual human's motion configuration. In this case, 32 specific joint positions of the user were extracted from the depth camera, and each intended gesture was predicted based on the user's joint position information. Joint information was used to recognize whether the user performed a gesture at a specific point in time. Additionally, to express more sufficient virtual human motion, the virtual human's gaze and eyeball orientation were automatically transformed by recognizing the user's location and the locations of sounds in the surrounding environment. Figure 5 shows examples of the motions of the virtual human that we set up as candidates for our experiment. The results of the virtual human responses were expressed in a life-sized form through the given display devices. To express the whole-body motion of the virtual human, we selected candidates mapped to the suggested authoring toolkit that dynamically respond to changes in the actual human's motion configuration. In this case, 32 specific joint positions of the user were extracted from the depth camera, and each intended gesture was predicted based on the user's joint position information. Joint information was used to recognize whether the user performed a gesture at a specific point in time. Additionally, to express more sufficient virtual human motion, the virtual human's gaze and eyeball orientation were automatically transformed by recognizing the user's location and the locations of sounds in the surrounding environment. Figure 5 shows examples of the motions of the virtual human that we set up as candidates for our experiment. The results of the virtual human responses were expressed in a life-sized form through the given display devices.
To express the whole-body motion of the virtual human, we selected candidates mapped to the suggested authoring toolkit that dynamically respond to changes in the actual human's motion configuration. In this case, 32 specific joint positions of the user were extracted from the depth camera, and each intended gesture was predicted based on the user's joint position information. Joint information was used to recognize whether the user performed a gesture at a specific point in time. Additionally, to express more sufficient virtual human motion, the virtual human's gaze and eyeball orientation were automatically transformed by recognizing the user's location and the locations of sounds in the surrounding environment. Figure 5 shows examples of the motions of the virtual human that we set up as candidates for our experiment. The results of the virtual human responses were expressed in a life-sized form through the given display devices. Figure 5. Examples of virtual human motion that we set up as candidates: the virtual human's motion types consisted of idle, applause, shy, surprised, and so on. Figure 6 shows an example of the interaction results between the user and the virtual human. This figure shows a scene that provides responses to the user's non-verbal inputs, such as gestures and emotions. As noted earlier, we used the toolkit to match the interactive animation of the virtual human according to the user's input with inference via fuzzy logic. We leave it as future work to provide and match a greater range of motions from daily life. During the experiment, the virtual human was visualized on a personal computer running a 64-bit version of Windows 10.

Experiment to Determine Effects on Co-Presence
We now describe an experiment that examined how people respond to two different conditions, labeled here as display and interaction types. The experiment was designed as a two-factor (two levels each) between-subject measurement. The first factor was display type (a large 2D TV vs. a 3D immersive HMD), and the second factor was the inter-

Experiment to Determine Effects on Co-Presence
We now describe an experiment that examined how people respond to two different conditions, labeled here as display and interaction types. The experiment was designed as a two-factor (two levels each) between-subject measurement. The first factor was display type (a large 2D TV vs. a 3D immersive HMD), and the second factor was the interaction type (verbal vs. non-verbal). Thus, we examined four different configurations, as shown in Table 2. Note that our experiment excluded duplicate characteristics, as indicated in Table 1, and 3D and non-verbal situations were considered and selected. Table 2. The four experimental conditions: the configured factors consisted of display types (large 2D TV vs. 3D immersive HMD) and interaction types (verbal vs. non-verbal). For example, in the 3D-verbal test conditions, the subject wore a head-mounted display (HTC Vive) and was allowed to talk only with the virtual human. On the other hand, the 2D-non-verbal situation was designed to allow interaction gestures in front of a life-sized display that visualized the virtual human at the same size used in the 3D condition.

3D-verbal
In the experiment, to assess the level of user-felt co-presence, the subjects attempted to have a conversation with the interactive virtual human. We adopted the simulated method of a greeting scenario to engage in typical conversations or gestures with two people, where a student in a class attends lessons at a university, as proposed by Shin and Jo [45]. Then, because most of the participants in the experiment were college students, the participants played the role of the student, and the virtual human was expressed as a teacher. Verbal questions asked or non-verbal gestures performed by the participant and directed toward the virtual human were presented as given candidates in the common conversation, and the virtual human's verbal answers and non-verbal gestures were inferred via fuzzy logic using the previously introduced toolkit. Each participant was allowed to interact with the virtual human (See Figure 7 for the detailed process). In our experiment, we utilized 50 verbal sentences and seven non-verbal motions.
Electronics 2021, 10, x FOR PEER REVIEW 9 of 14 ferred via fuzzy logic using the previously introduced toolkit. Each participant was allowed to interact with the virtual human (See Figure 7 for the detailed process). In our experiment, we utilized 50 verbal sentences and seven non-verbal motions. Forty paid subjects (25 men and 15 women) with a mean age of 27.8 years participated in the between-subject experiment. The participants in the experiment were divided into four groups of ten each, and the groups were equally divided based on the prior ex-  Forty paid subjects (25 men and 15 women) with a mean age of 27.8 years participated in the between-subject experiment. The participants in the experiment were divided into four groups of ten each, and the groups were equally divided based on the prior experience of virtual reality (VR). The experiment was designed as a two-factor experiment (two display types × two interaction types), and every participant experienced only one condition to avoid any learning effects in various conditions. Note that a within-subject experimental design was not used due to the learning effect. Before carrying out the task for the given treatment, the participants were initially briefed on the overall purpose of the experiment for approximately five minutes, with the virtual teacher leading the conversation or providing a greeting gesture. Each participant took part in the interaction form of greeting for three minutes with the virtual teacher. Our task simulated the virtual human's interaction situations, which allows participants to give and receive responses to the specific questions and behaviors [46]. Participants can receive verbal or non-verbal feedback from the virtual human such as greetings at the start of a new semester. Our system also provided participants' response candidates for the simulated question and the given gesture of the virtual human, which gives related interaction to the subject. If the participant did not act or say anything, the virtual human reacted by speaking, making gestures, and engaging in movements according to the control of the administrator behind the curtain, referred to here as Wizard-of-Oz testing [23]. Upon completion of the task, the participants' sense of co-presence was measured via a survey. Here, we defined co-presence as how much a subject perceived someone to be in the same space [34]. Additionally, we investigated immersive tendencies scores (ITQs) of the virtual human in given conditions using Witmer and Singer's immersive tendencies questionnaire [35]. The following questions were posed: (1) "When interacting with the virtual human, do you become as involved as you would in the actual situation?", and (2) "Did you feel as if you were present with the virtual human in the same space?" The subjects completed a quantitative questionnaire to assess the degrees of immersion and co-presence on a seven-point Likert scale. We also collected qualitative feedback.

Results and Discussion
First, immersion was judged in situations with an actual person to establish the "ground truth". We also included a baseline case and projected a video of the actual person onto the display system before starting the experiment in each condition. We compared the immersion scores generated by the different conditions through a one-way ANOVA. The results did not reveal statistically significant main effects regarding the level of immersion in the four conditions, F(2,15) = 32.89, p > 0.05 (See Figure 8). On the other hand, interaction with the virtual human was found to have positive effects on immersion (in all cases over average four-point scores).
Electronics 2021, 10, x FOR PEER REVIEW

Results and Discussion
First, immersion was judged in situations with an actual person to establ "ground truth". We also included a baseline case and projected a video of the actu son onto the display system before starting the experiment in each condition. W pared the immersion scores generated by the different conditions through a o ANOVA. The results did not reveal statistically significant main effects regarding t of immersion in the four conditions, F(2,15) = 32.89, p > 0.05 (See Figure 8). On th hand, interaction with the virtual human was found to have positive effects on imm (in all cases over average four-point scores).  Second, two-sample t-tests between non-verbal and verbal conditions (i.e., 3D-nonverbal vs. 3D-verbal or 2D-non-verbal vs. 2D-verbal) revealed significant main effects with p < 0.05 regarding the level of co-presence. A pairwise comparison showed a main effect in the statistical analysis (3D-non-verbal and 3D-verbal, t(16.35) = 7.48, p < 0.05; 2Dnon-verbal and 2D-verbal, t(17.967) = −8.05). Note that a t-test can be used to determine if the means of two sets of data are significantly different from each other [47]. When we plotted the distribution of the values of the sample, it was possible to infer that a normal distribution was established on a plot. Then, we found that there was a significant difference from our result in an unequal variance situation where more reliable data can be estimated. Additionally, we used the Wilcoxon signed-rank test for a non-parametric statistical hypothesis; this method is used when the distribution is not normal. The same result was obtained. In our results, the level of co-presence was clearly different for each configuration in the non-verbal and verbal interactions (See Figure 9). Participants who used the immersive 3D display reported that non-verbal interaction was more effective; in the case of the 2D XR environmental situation, it was found that dialogue-based verbal interaction improved the user's sense of being in the same space (co-presence). Through subjective answers by an additional questionnaire completed by the participants, we found that verbal interaction in 2D was more effective in terms of co-presence, as the participants feel more familiar with this setup given their possible experience with video conferencing systems. On the other hand, in the 3D immersive environment, the 3D interaction method (e.g., non-verbal gestures) scored higher than the verbal approach, and it is expected to be helpful in terms of co-presence. Thus, participants felt that verbal interaction was familiar as a form of interaction with the 2D environment, while they commented that gesture-based non-verbal interaction was more suitable for the 3D environment. Moreover, two-sample t-tests between the 2D and 3D conditions (i.e., 2D-verbal vs. 3D-verbal or 2D-non-verbal vs. 3D-non-verbal) indicated significant main effects with p < 0.05 as well. The pairwise comparison showed a main effect in the statistical analysis (2D-verbal and 3D-verbal, t(16.89) = 7.47, p < 0.05; 2D-non-verbal and 3D-non-verbal, t(17.789) = −8.06). Then, the same statistical analysis method as already mentioned above was applied. The level of co-presence was significantly different for each configuration in 2D and 3D. As noted earlier, subjects in the immersive 3D display environment using a head-mounted display (HMD) reported that gesture-based non-verbal interaction was most effective. For the 2D environment, we found that dialogue-based verbal interaction was more important with respect to co-presence with the virtual human.
Electronics 2021, 10, x FOR PEER REVIEW mounted display (HMD) reported that gesture-based non-verbal interaction was m fective. For the 2D environment, we found that dialogue-based verbal interacti more important with respect to co-presence with the virtual human. For a more accurate analysis, we also applied a qualitative analysis with fe from the participants [31]. It was conducted in the form of three short questions a swers. The questions are as follows: "What were your criteria regarding immer "What were your criteria regarding co-presence?", and "What do you think need For a more accurate analysis, we also applied a qualitative analysis with feedback from the participants [31]. It was conducted in the form of three short questions and answers. The questions are as follows: "What were your criteria regarding immersion?", "What were your criteria regarding co-presence?", and "What do you think needs to be done to achieve higher co-presence with the virtual human?" In the results, there was a consensus in that the participants reported that the overall 3D environment appeared to be high in immersion, but a major difference was noted. This is identical to the result found in the statistics. Interestingly, there were several answers to the question of focusing on copresence criteria. A few participants mentioned that the 3D interaction method was familiar in a 3D environment (e.g., gesture-based), whereas the 2D interaction method, i.e., verbalbased interaction, helped with co-presence in the 2D environment. One participant reported that it would improve the feeling of co-presence to provide possible interactions all at once (e.g., multimodal interactions) rather than a single input. In order to obtain the result of this situation, we will leave it as a future work to determine this improvement, if any, when interacting with a virtual human. Additionally, some participants mentioned limitations of the VR equipment preventing an increase to higher co-presence (e.g., the lightweight headsets, precise interaction, and an immersive display). Thus, we plan to conduct an experiment in the future using an immersive projection display space such as CAVE with a wide field of view (FOV) [48].
With our results, we presented a preliminary implementation of interaction with a virtual human, partly validating the effectiveness of a future virtual human system. Moreover, we were able to present a setup guideline for the virtual human. There are still many aspects that need to be improved to provide full guidelines on how to increase copresence with the virtual human. Additionally, to increase internal reliability of our study, we hope to provide the same environmental configuration in 2D and 3D. In our situation, the experiment was conducted by implementing the same 2D surrounding environment as far as possible with 3D modeling. We will need to overcome the hardware experimental factors for the weight of the head-mounted display in 3D to improve threats to external validity. Furthermore, it will be necessary to evaluate and investigate co-presence by presenting the various situations of interaction with a real person, such as a job interview.

Conclusions and Future Works
Recent advances in virtual human technologies will provide more helpful information to assist in real-world tasks and to increase the level of intelligence in our lives. Herein, we presented the effects of display and interaction types on the participants' feeling of co-presence-the feeling of being together with the perceived virtual human-for the design of virtual human interaction systems. We focused on the co-presence of the virtual human and used an approach focusing on the perception from the user's intentions, such as when they used their voice and gestures. In our study, participants who used an immersive 3D display with a head-mounted display (HMD) reported that gesture-based non-verbal interaction was most effective. On the other hand, in the case of the 2D XR environmental situation used here, it was found that dialogue-based verbal interaction improved the user's sense of sharing and being in the same space. These findings have implications for the design of more effective virtual human systems that offer a high sense of co-presence.
In future works, we will continue to explore the effects of other related factors to improve co-presence with the virtual human. We also hope to apply interaction response methods to generate the behavior of the virtual human using AI learning algorithms to improve the quality of interaction and employ physiological measures to assess co-presence quantitatively in various situations.