Next Article in Journal
Show-and-Tell: An Interface for Delivering Rich Feedback upon Creative Media Artefacts
Previous Article in Journal
Accessible Metaverse: A Theoretical Framework for Accessibility and Inclusion in the Metaverse
 
 
Article
Peer-Review Record

Do Not Freak Me Out! The Impact of Lip Movement and Appearance on Knowledge Gain and Confidence

Multimodal Technol. Interact. 2024, 8(3), 22; https://doi.org/10.3390/mti8030022
by Amal Abdulrahman, Katherine Hopman and Deborah Richards *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Multimodal Technol. Interact. 2024, 8(3), 22; https://doi.org/10.3390/mti8030022
Submission received: 5 February 2024 / Revised: 23 February 2024 / Accepted: 28 February 2024 / Published: 5 March 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In response to previous reports of the effects of virtual agent characteristics on user experience, the current study examined the effects of virtual agent realism and lip synching on how participants viewed the agents in terms of how eery they were, how they rated them on the Artificial Social Agent questionnaire, their confidence in their ability to apply what they were taught, and any actual changes in knowledge after interacting with the virtual agent. Results indicated that any virtual agent interaction appeared to increase confidence but did not significantly impact knowledge gain. Further, the less realistic agents, and those without lip synching were viewed more positively on average. The paper is relatively clear, the design sound, and the question interesting. Please see the following suggestions and necessary changes for the revision of this manuscript:

1.      Page 1, Lines 11-14: It should be at least briefly noted that there was no significant impact on objective knowledge in any conditions of the study.

2.      Pages 4-5, Lines 155-162: More concrete participant demographics should be included here (e.g. mean age, gender breakdown, etc.).

3.      Page 5, Line 207: Should this be “uncanny valley?”

4.      Page 7, Lines 249-250: Perhaps it should be clarified that the statistics provided in Table 1 are for gender and not cultural background or anything else.

5.      Page 9, Line 270: Is “favoring the cartoon-like Erica” correct here? It feels that it should be stating that either with or without lip-sync should be what is favored.

6.      Discussion: With “realism” being a relative term, is it possible the “realistic” Erica was not realistic enough to produce a strong Uncanny Valley effect? Should more photorealistic models be tested to see if this influences outcomes and impressions? I realize there is mention of “…further research with systematic variation in the levels of appearance…” but this seems directly relevant to the original intended design of the current study and should perhaps be more directly noted.

7.      Figure 4(b) and Page 10, Lines 327-329: The figure seems to indicate more positive ratings for the lip-synching condition, which is contrary to what is stated at this point in the text. Are the labels incorrect for Figure 4(b)?

Author Response

We would like to thank the editor and reviewers for their valuable comments. We provide our responses below.

Reviewer 1

In response to previous reports of the effects of virtual agent characteristics on user experience, the current study examined the effects of virtual agent realism and lip syncing on how participants viewed the agents in terms of how eery they were, how they rated them on the Artificial Social Agent questionnaire, their confidence in their ability to apply what they were taught, and any actual changes in knowledge after interacting with the virtual agent. Results indicated that any virtual agent interaction appeared to increase confidence but did not significantly impact knowledge gain. Further, the less realistic agents, and those without lip syncing were viewed more positively on average. The paper is relatively clear, the design sound, and the question interesting.

RESPONSE: Thank you for your time and kind comments.

Please see the following suggestions and necessary changes for the revision of this manuscript:

  1. Page 1, Lines 11-14: It should be at least briefly noted that there was no significant impact on objective knowledge in any conditions of the study.

RESPONSE: We have updated the sentence to include “(Page1, Lines 12-13)…, all groups reported no significant increase in knowledge but significant increases in confidence in their knowledge and ability to ..”

  1. Pages 4-5, Lines 155-162: More concrete participant demographics should be included here (e.g. mean age, gender breakdown, etc.).

RESPONSE: To indicate the demographics of the group we were targeting/recruiting from, we added the following to section 2.2 Recruitment “(Page 4, Lines 166-168) This pool comprises first-year psychology students who can receive course credit for research participation. The average age of this cohort is typically 21.7 (s.d. 6.747) years [], comprised of around 75% females from a range of cultural backgrounds.”

In the Results section in Section 3.1 Participants, we provided the detail of the actual gender breakdown in Table 1 by group and the cultural background percentages across all groups in the text. We did not do any analysis by gender or cultural group but provided this information as a description only. We did not capture the age of the participants, so we are unable to report that specifically for this study.  

  1. Page 5, Line 207: Should this be “uncanny valley?”

RESPONSE: Yes, thanks we have changed the mistake as advised.

  1. Page 7, Lines 249-250: Perhaps it should be clarified that the statistics provided in Table 1 are for gender and not cultural background or anything else.

RESPONSE: We have updated the Table 1 label to “Number of participants and gender distribution among the four experimental groups”

  1. Page 9, Line 270: Is “favoring the cartoon-like Erica” correct here? It feels that it should be stating that either with or without lip-sync should be what is favored.

RESPONSE: Thank you for picking this up. We have changed to “favoring Erica without lip-syncing”.

  1. Discussion: With “realism” being a relative term, is it possible the “realistic” Erica was not realistic enough to produce a strong Uncanny Valley effect? Should more photorealistic models be tested to see if this influences outcomes and impressions? I realize there is mention of “…further research with systematic variation in the levels of appearance…” but this seems directly relevant to the original intended design of the current study and should perhaps be more directly noted.

DISCUSSION: We wish to acknowledge your point and have added the following to the discussion “(Page 10, Lines 324-331) As a further point concerning eeriness, it is possible that humanlike Erica was not realistic enough to produce a strong Uncanny Valley effect. However, we were not trying to induce the uncanny valley effect. As explained in the introduction, we had received comments from users of our health-related conversational agents (which includes Erica) concerning character freakiness, particularly relating to lip-syncing. Before deciding to give users the option to switch off lip-syncing, we wanted to check the impact of disabling lip-syncing on our intended outcomes, while also exploring whether a cartoon-like character would influence user experience.”

We have similarly struggled to find the right word to describe “realistic Erica” and have updated our description in the Study Design Section (Page 4, Lines 154-163) as follows:

“The independent variables manipulated were the VAs appearance (humanlike vs. cartoon-like) and lip-syncing (with vs. without). As our key goal was to validate the types of characters we were currently using in a range of different studies which had been made with FUSE, we selected one of the FUSE characters to which we had received comments of creepiness. We refer to this model as realistic, acknowledging that the model is not photo-realistic but rather has human-like features. We then used Ready Player Me to transform our original Erica into a cartoon-like Erica. Fuse platform uses high-resolution textures and advanced shaders to create lifelike skin, hair, and clothing while Ready Player Me avatars have exaggerated features (e.g. bigger eyes) and simplified textures.”

  1. Figure 4(b) and Page 10, Lines 327-329: The figure seems to indicate more positive ratings for the lip-synching condition, which is contrary to what is stated at this point in the text. Are the labels incorrect for Figure 4(b)?

RESPONSE: Thanks for spotting the labelling mistake. It is fixed now.

Reviewer 2 Report

Comments and Suggestions for Authors

Lip syncing for virtual avatars has been a research topic for quite some time. There are several studies out there that focus on the purpose of this paper (or at least they investigate several parts also presented in this paper). So the strong points of this paper are the experimental methodology and the user study. Both are treated well in the manuscript. Other than this, here are a few issues:

1. I'm not particularly happy with the title. I'd try to make it smaller, while still focusing on the right keywords.

2. The state of the art is decent, although there are recent references missed by the authors (e.g. Peixoto, B., Melo, M., Cabral, L., & Bessa, M. (2021, November). Evaluation of animation and lip-sync of avatars, and user interaction in immersive virtual reality learning environments. In 2021 International Conference on Graphics and Interaction (ICGI) (pp. 1-7). IEEE.).

3. Fig. 3 seems stretched. Fig. 4 a) b) could be bigger.

4. Further research should be extended.

Comments on the Quality of English Language

English is fine.

Author Response

We would like to thank the editor and reviewers for their valuable comments. We provide our responses below.

Reviewer 2

Lip syncing for virtual avatars has been a research topic for quite some time. There are several studies out there that focus on the purpose of this paper (or at least they investigate several parts also presented in this paper). So the strong points of this paper are the experimental methodology and the user study. Both are treated well in the manuscript.

RESPONSE: Thank you for your time and kind comments.

 

 Other than this, here are a few issues:

  1. I'm not particularly happy with the title. I'd try to make it smaller, while still focusing on the right keywords.

RESPONSE: Your comment has prompted us to have a lengthy discussion over the title. We preferred to keep the first part of the title “Don’t Freak Me Out!” because it describes the motivation for our work driven by comments from users in our previous studies. However, by focusing on the right keywords as you suggested, we have modified the title, to alter “lip syncronisation” to “lip movement” and “visual realism” to “Appearance” with the resulting new title “Don’t Freak Me Out! The Impact of Lip Movement and Appearance on Knowledge Gain and Confidence”.

  1. The state of the art is decent, although there are recent references missed by the authors (e.g. Peixoto, B., Melo, M., Cabral, L., & Bessa, M. (2021, November). Evaluation of animation and lip-sync of avatars, and user interaction in immersive virtual reality learning environments. In 2021 International Conference on Graphics and Interaction (ICGI)(pp. 1-7). IEEE.).

RESPONSE: Thank you for the reference. We have added the following regarding this.

While researchers have investigated the effect of the various VAs modalities including voice (synthetic versus human) and facial expressiveness [20] and the incongruence between the two [18], limited research has examined the effect of the presence or absence of lip syncing (i.e. lip movement versus no lip movement) on user experience and user outcomes (e.g. learning, self-efficacy  or health outcomes). An exception to this is work by [21] which identified that alignment between animation and lip syncing in a VR environment (i.e. the conditions of animation and lip syncing and no animation, no lip syncing) lead to slightly better knowledge retention than when conditions were not aligned. Prompting the authors to suggest that greater knowledge retention was potentially related to a lack of distraction in the no animation, no lip syncing condition.

Despite the above findings, lip-sync is a basic animation included in the design of embodied conversational agents (ECAs).

  1. Fig. 3 seems stretched. Fig. 4 a) b) could be bigger.

RESPONSE: These figures have been adjusted.

  1. Further research should be extended.

RESPONSE: We have extended the conclusion section to discuss implications for theory and practice with further discussion of future research as follows”

 Relating to the finding reported in this study, in our upcoming study using realistic Erica, we are providing the option for users to turn off lip movement. That study will involve up to 6 interactions over a 3 week period and we will turn lip-syncing on at the start of each session. We intend to use this data to determine how often people prefer to not have lip movement, whether they care enough to turn it off and to keep turning it off and to identify if there are any profiles or patterns in who or when they choose to turn off lip movement.

Reviewer 3 Report

Comments and Suggestions for Authors

The study is well-structured and methodologically well-explored. Some information on the application of different methods is missing. The conclusions should also be clearer and more appealing.

Improvement suggestions:

1. Authors note “We developed this estimated 10-12 minute interactive experience using UNITY3D, Salsa LipSync….” I would expect to have more information regarding these tools.

2. Authors state “Participants are then randomly assigned to interact with one of the four versions of Erica.” How do you guarantee a uniform distribution of the students by the four versions if they are randomly assigned?

3. I would expect to have information regarding the structure and questions of the pre-study and post-study questionnaire.

4. Authors note the use of the eeriness questionnaire [40], and the Artificial Social Agent (ASA) questionnaires. The objectives of this approach is clear but not their structure.

5. Authors state “In this study, we asked the participants to report their cultural background, gender and age.” However this information is only provided later and in a descriptive approach. I would expect to have a table with this information.

6. Authors also note “Logfile data from interactions with Erica were collected for the purpose of confirming participants’ engagement and quantifying the duration of their interactions.” How do you measure participants’ engagement?

7. Explain better the concept of “full interaction”.

8. I would expect to have a clear vision regarding the theoretical and practical contributions.

Comments on the Quality of English Language

it is ok. 

Author Response

We would like to thank the editor and reviewers for their valuable comments. We provide our responses below.

Reviewer 3

The study is well-structured and methodologically well-explored. Some information on the application of different methods is missing. The conclusions should also be clearer and more appealing.

RESPONSE: Thank you for your time and kind comments.

Improvement suggestions:

  1. Authors note “We developed this estimated 10-12 minute interactive experience using UNITY3D, Salsa LipSync….” I would expect to have more information regarding these tools.

RESPONSE: We have added footnotes for these tools and clarified that UNITY3D is a game engine in the text (Page 5, Lines 202-203). We further added links in the footnotes to other game tools we used such as Adobe Fuse, and Ready Player Me.

  1. Authors state “Participants are then randomly assigned to interact with one of the four versions of Erica.” How do you guarantee a uniform distribution of the students by the four versions if they are randomly assigned?

RESPONSE: We have added the following to the Procedure Section “(Page 4, Lines 179-181 ) We used the 'distribute evenly' randomisation feature in the Qualtrics survey software to ensure equal numbers in each group. We didn’t use stratified allocation to groups as we were unable to control who selected our study.”

To clarify why we do not have an even distribution in the data used in the analysis across all groups we added a phrase in the results “resulting in unequal numbers of participants in each of the four conditions” as follows: “(Page 7, Lines 250-263) Although all participants completed the study with the post-interaction questionnaires, only 152 out of the 220 fully interacted with their assigned Erica. Those who did not complete a full interaction (n=68) were deemed ineligible for analysis, resulting in unequal numbers of participants in each of the four conditions. Additionally, 2 out of the 152 participants failed the attention check, resulting in 150 participants being deemed eligible for the analysis. The distribution of the participants across the four groups, along with their gender, is presented in Table 1.”.

  1. I would expect to have information regarding the structure and questions of the pre-study and post-study questionnaire.

RESPONSE: We have sought to improve the organization and flow of the methods section to clarify better the structure and questions. Figure 1 provides the overall structure of the study. We have added a reference to the figure in the procedure section, added references to the subsections indicating where the details are found and moved the description about the interaction to the dialogue subsection where it sits better.

 

  1. Authors note the use of the eeriness questionnaire [40], and the Artificial Social Agent (ASA) questionnaires. The objectives of this approach is clear but not their structure.

RESPONSE: We have added a sample question from the ASA questionnaire in that subsection in the methods section.

For the Eeriness questionnaire, we describe the scale and, as we indicate, the actual poles provided in the survey are provided in Figure 3. As an improvement, we added to the paper: 1) the numbers presented to the users on the scales to indicate the negative, positive and neutral positions on the scales, and 2) the phrase “The items were presented in a random order to the participants to control for order bias.” at the end of the Section 2.5.3.

For ASA questionnaire (Section 2.5.4), we added the following sentence at the end of the section “As an example, humanlike appearance is measured with the item "Erica has the appearance of a human" on the 7-point Likert scale. Participants receive the 24 items in a random order.”

  1. Authors state “In this study, we asked the participants to report their cultural background, gender and age.” However this information is only provided later and in a descriptive approach. I would expect to have a table with this information.

RESPONSE: To indicate the demographics of the group we were targeting/recruiting from, we added the following to Section 2.2 Recruitment “Based on previous studies we have conducted, the average age of this cohort is typically 21.7 (s.d. 6.747) years, comprised of around 75% females from a range of cultural backgrounds.

In the Results section in Section 3.1 Participants, we provided the detail of the actual gender breakdown in Table 1 by group and the cultural background percentages across all groups in the text. We did not do any analysis by gender or cultural group, but provide this information as a description only. We did not capture the age of the participants, so we are unable to report that specifically for this study.  

  1. Authors also note “Logfile data from interactions with Erica were collected for the purpose of confirming participants’ engagement and quantifying the duration of their interactions.” How do you measure participants’ engagement?

RESPONSE: We have replaced our mention of engagement with the following in Section 2.5.5 Logfile Data:

“We collected all participant responses with Erica and the duration of their interaction. For the purpose of the data analysis we undertook, we used the logfile data to determine whether participants had actually interacted with Erica or not. Only participants who completed their interaction with Erica were included in the data analysis. It's crucial to ensure participants reviewed the emotion regulation strategies with Erica. This way, any change in their knowledge or confidence can be confidently linked to their interaction with the agent.”

  1. Explain better the concept of “full interaction”.

RESPONSE: We modified  “fully interacted with their assigned Erica” to “got to the end of the conversation with their assigned Erica.”

  1. I would expect to have a clear vision regarding the theoretical and practical contributions.

The study presented in this paper explores and contributes to our understanding of the perception of VAs and their impact on behaviour change, focusing on two design aspects: eeriness perception and user-agent interaction experience. The absence of significant between-group differences suggests that appearance and lip-syncing did not distinctly affect eeriness perception. The study also uncovered that while lip-syncing did not significantly impact eeriness perception, it did negatively influence the user-agent interaction experience in various dimensions, indicating that the inclusion of lip-syncing may disrupt the overall perception of the VA. This is noteworthy because lip-syncing is a common VA animation perhaps due to an assumption that it is expected by the user and/or that it is beneficial to the interaction experience.

The findings of this study have implications for both theory and practice relating to the design and use of conversational agents. From a theoretical perspective, our work confirms the importance of congruence and realism with respect to virtual agents, i.e. while lip movement is normal and expected in humans, lip movement and realistic appearance is not required in virtual humans. In fact, as reported by others, eeriness and uncanny valley is associated with a high level of realism. These findings go against a trend to increase realism both in appearance and lip movement in virtual agents, and virtual reality models in general. From a practical perspective, while users have preferences regarding appearance and lip movement, these preferences do not necessarily impact on the intended outcomes and benefits to humans. These findings suggest that developers of virtual agents should focus more on the intended benefits of the virtual agent for the human, rather than on measurements such as believability, naturalness or liking. Specifically, VA developers should reconsider whether lip movement is included and/or whether they should allow the user to choose to switch on or off.

Participants did not experience a substantial change in knowledge but exhibited increased confidence following interaction with the VA. The study highlights the importance of considering factors beyond knowledge acquisition, such as confidence, in evaluating the effectiveness of VA interactions in the education context.

Relating to the finding reported in this study, in our upcoming study using realistic Erica, we are providing the option for users to turn off lip movement. That study will involve up to 6 interactions over a 3 week period and we will turn lip-syncing on at the start of each session. We intend to use this data to determine how often people prefer to not have lip movement, whether they care enough to turn it off and to keep turning it off and to identify if there are any profiles or patterns in who or when they choose to turn off lip movement.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The revision work was well done. 

Comments on the Quality of English Language

it is ok. 

Back to TopTop