Next Article in Journal
Methodology for Testing Acoustic Absorption of Lightweight Fabrics with 3D Microstructures Using Impedance Tube
Previous Article in Journal
Sonic Boom Impact Assessment of European SST Concept for Milan to New York Supersonic Flight
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation

1
Department of Architecture and Industrial Design, Università degli Studi della Campania “Luigi Vanvitelli”, 81031 Aversa, CE, Italy
2
Immensive s.r.l.s., 81030 Parete, CE, Italy
*
Authors to whom correspondence should be addressed.
Acoustics 2025, 7(2), 30; https://doi.org/10.3390/acoustics7020030
Submission received: 28 March 2025 / Revised: 6 May 2025 / Accepted: 21 May 2025 / Published: 23 May 2025

Abstract

:
Speech intelligibility (SI) is critical in effective communication across various settings, although it is often compromised by adverse acoustic conditions. In noisy environments, visual cues such as lip movements and facial expressions, when congruent with auditory information, can significantly enhance speech perception and reduce cognitive effort. In an ever-growing diffusion of virtual environments, communicating through virtual avatars is becoming increasingly prevalent, thus requiring a comprehensive understanding of these dynamics to ensure effective interactions. The present study used Unreal Engine’s MetaHuman technology to compare four methodologies used to create facial animation: MetaHuman Animator (MHA), MetaHuman LiveLink (MHLL), Audio-Driven MetaHuman (ADMH), and Synthetized Audio-Driven MetaHuman (SADMH). Thirty-six word pairs from the Diagnostic Rhyme Test (DRT) were used as input stimuli to create the animations and to compare them in terms of intelligibility. Moreover, to simulate a challenging background noise, the animations were mixed with a babble noise at a signal-to-noise ratio of −13 dB (A). Participants assessed a total of 144 facial animations. Results showed the ADMH condition to be the most intelligible among the methodologies used, probably due to enhanced clarity and consistency in the generated facial animations, while eliminating distractions like micro-expressions and natural variations in human articulation.

1. Introduction

Communication is an essential aspect of human interaction, occurring both in written and spoken forms. When engaging in verbal communication, listeners do not rely solely on auditory input but also interpret visual cues, such as lip movements, to enhance speech perception and differentiation [1,2]. Research in neuroimaging has confirmed the multimodal nature of speech processing by identifying activation in the superior temporal sulcus, a brain region associated with integrating multiple sensory inputs [3]. Additionally, studies suggest that facial movements influence activity in the auditory cortex in both humans and primates [4]. The significance of the congruency of auditory and visual cues is further supported by the McGurk effect, which demonstrates how mismatched inputs can lead to an altered phoneme perception [5]. Beyond aiding speech recognition, visual cues help to decrease cognitive effort during listening, making speech comprehension more efficient [6]. Given the limitations of cognitive capacity, challenging listening conditions demand greater effort for speech processing, thereby reducing the mental resources available for other tasks, such as memorization and critical thinking [6]. This issue is particularly relevant in educational environments, where excessive listening effort can contribute to cognitive fatigue, impairing concentration and academic performance. In learning settings, for example, communication primarily occurs through lectures, where teachers convey information to students orally. However, education buildings do not always provide optimal listening conditions. Well-known problems relate to background noise [7], chattering among students [8], and inadequate room acoustics [9]. In the presence of such difficulties, visual cues such as lip movements can greatly improve the intelligibility of speech and help listeners to enhance their comprehension capabilities [10].
Speech intelligibility (SI) is a critical factor in ensuring that verbal exchanges are clear, efficient, and comprehensible. However, SI is frequently challenged by adverse acoustic conditions, such as background noise, reverberation, speaker–listener distance, and even individual listener factors, including hearing impairments or language proficiency [11,12]. University classrooms, for instance, often present difficult acoustic conditions, making it challenging for students to follow lectures and process complex information effectively [12]. Given the cognitive demands of learning, poor SI can significantly impact information retention, engagement, and overall academic performance, underscoring the need for innovative assessment methods and enhancement strategies. Given its importance, visual speech cues should always be incorporated in studies dealing with SI using video-based materials. In 2010, a study examined whether lip movement and hand gesture help to improve native English speakers’ ability to comprehend Japanese [13]. Similarly, another study highlighted the importance of lip-sync in supporting foreign language pronunciation practice [14].
Furthermore, recent advancements in human–computer interaction and audio–visual speech synthesis have facilitated the creation of virtual avatars capable of replicating realistic human speech. Facial animations in 3D have shifted from handcrafted, rule-based systems to data-driven machine-learning methods capable of realistically replicating articulatory facial movements. Articulatory models aim to simulate the behavior of anatomical structures such as lips, jaw, and tongue, enabling high-fidelity speech-driven facial animation by extracting parameters like lip height, width, and lip protrusion [15]. This method ensures that facial movements are not just visually plausible but also reflect linguistically significant features, including coarticulation effects and timing relations between audio and articulation.
More recently, image- and video-based techniques have used facial landmarks to animate models [16] and deep-learning methods to generate riggable 3D faces from a single image [17]. In parallel, the field has also evolved toward neural speech synthesis systems that can drive facial animation in a purely data-driven manner. More recent deep-learning models introduced end-to-end neural speech synthesis, converting text to waveform audio with realistic prosody and timing, making it possible to model speech features and facial keypoints, enabling the generation of smooth, temporally coherent 3D facial animations from raw audio [18,19].
Virtual tutors, for instance, have been employed to support science education in children [20] and to enhance pronunciation training in language learning and rehabilitation settings [21]. These avatars provide a highly controlled yet ecologically valid framework for studying speech perception, offering dynamic lip-sync animations and realistic facial movements that closely mimic natural human speech [22].
Unlike traditional video recordings, which are limited in flexibility, virtual avatars allow researchers to manipulate various parameters, such as speaker distance, lighting conditions, and background noise levels, enabling comprehensive investigations of how different listening environments affect SI. Studies have demonstrated that the presence of virtual human speakers delivering synchronized audio–video cues can significantly improve SI scores, particularly in challenging listening environments where auditory input alone may be insufficient [23]. The use of virtual avatars as an alternative to traditional audio-only listening experiments has already proven to be a valuable tool for research on speech communication [14].
The present study utilized Unreal Engine’s MetaHuman technology [24] to identify optimal methodologies for the creation of intelligible avatars to be used in simulated real-world contexts, ranging from educational environments to public spaces and workplace interactions. To this aim, the present study integrated speech intelligibility evaluation instruments to determine which of the methodologies used can give the most beneficial outcomes in terms of avatars’ facial animation reliability. The integration of SI into virtual avatar research not only enhances the accuracy and ecological validity of assessments but also offers the basis for a versatile platform for developing innovative interventions to improve communication outcomes in noisy settings.

2. Materials

2.1. Virtual Scene and Avatar Creation

In 2021, Epic Games released MetaHuman Creator [24], a platform that aids the creation of photorealistic human avatars featuring detailed textures, high-quality 3D models, and complex face deformations. The process of animating virtual avatars has changed over the years, using different approaches. In particular, the method involving blendshapes (3D meshes that help to drive complex movements of parent meshes) has evolved rapidly in recent years, with the possibility of controlling such information also by means of deep learning [25,26]. MetaHumans are designed to support the 52 blendshapes standardized by Apple for ARKit, which represent the main facial movements and expressions used in face tracking. These 52 blendshapes cover a wide range of movements, such as eye closure, eyebrow raising, smiling, mouth opening, and lip and cheek movements, ensuring detailed and realistic facial animation. Thanks to this native compatibility, MetaHumans can be directly used with facial capture systems like LiveLink Face for iPhone, without the need for custom mappings, ensuring a perfect match between the captured data and the MetaHuman’s morph targets. In our study, a virtual avatar was created in the MetaHuman Creator platform. Once created on the online platform, the avatar can be directly exported into an existing Unreal Engine project and is ready to be animated. Furthermore, an empty white room was modelled in the 3D Studio Max 2024 software. The model was imported directly into Unreal Engine 5 to create the virtual scenario for the experiment.

2.2. Audio and Visual Stimuli Acquisition

To create the input materials to use in the test, the Diagnostic Rhyme Test (DRT) was taken into consideration as a well-known method used to assess speech clarity. The test was developed in Italian by the Fondazione Ugo Bordoni [27]; it consists of 210 disyllabic word pairs, distinguished by trait (Nasalità, Continuità, Stridulità, Coronalità, Anteriorità, Sonorità) and presented as two rhyming words, in which the initial consonant is changed. Only the first word of each pair is representative of the specific trait (see Table 1). Similar trait classification categories (voicing, nasality, sustention, sibilation, graveness, and compactness) are used for the English version of the DRT.
This test methodology has been already used in previous research to investigate speech intelligibility in primary schools and the effects of different kinds of noise [28], and to evaluate the listening efficiency of young students and teachers in classrooms [29]. From the DRT words list, 18 pairs were chosen to create both the auditory and visual stimuli for driving the animations in the test. For each pair, two recordings were made separately, one for the first word of the pair and one for the second, for a total of 36 words. All the stimuli words were preceded by a carrier phrase: “Adesso diremo la parola…” (“Now we will say the word…”).
For the creation of experimental stimuli, an audio–video recording setup was prepared. The setup consisted of the following:
  • Audio apparatus: A Rode NTG2 microphone connected to a Zoom H5 recorder was used for audio capture;
  • Video apparatus: An iPhone 14 Pro was used as the camera for video capture by means of the LiveLink app. The LiveLink app was used for (A) live-streaming face captures on the MetaHuman and (B) recording and extracting videos from the actor’ s face captures to be further processed and attached onto the MetaHuman in Unreal Engine (see Figure 1).
The recording took place in the test room of the Sens-i lab, in the Architecture Department of the Università degli Studi della Campania “Luigi Vanvitelli”.
Recorded audio and video materials were used to create several facial animations of the MetaHuman through four different methodologies (see the flowchart in Figure 2):
  • MetaHuman LiveLink (MHLL): The actor’s performances are captured through the LiveLink App and live streamed directly onto the MetaHuman as the acquisition process went on. While streaming, the captures were also saved as videos;
  • MetaHuman Animator (MHA): The actor’s performance captured through the LiveLink App is saved as MetaHuman performance files. Unlike the first methodology, these are not live streamed but imported into Unreal Engine, where they are processed using the MetaHuman Animator plugin. This plugin tracks facial landmarks such as where the key points of the actor’s face—such as the eyes, eyebrows, nose, mouth, and jaw—are detected and translated into animation data. These data are then converted into animation curves, which drive the MetaHuman’s control rig. Once attached onto the MetaHuman, all the acquisitions are saved as video files;
  • Audio-Driven Animation for MetaHuman (ADMH): From the MetaHuman performance files, only the audio from the human actor performance is extracted and used to create facial animations through the Audio-Driven Animation for MetaHuman plugin. This feature allows for the processing of only audio files into realistic facial animations. Once attached to the MetaHuman, all the acquisitions are saved as video files;
  • Synthetic Audio-Driven Animation for MetaHuman (SADMH): For this condition, first a 1-minute-long speech audio file from the same human actor was fed into the software ElevenLabs (Eleven Turbo 2.5) [30] to clone his voice with AI technology. Once the voice cloning was complete, the written form of the stimuli, carrier phrase, and target word were uploaded in ElevenLabs to obtain the synthesized voice version of the stimuli. Finally, the obtained stimulus audio files with synthesis were used in the abovementioned Audio-Driven Animation for MetaHuman plugin for the last animation methodology.
Furthermore, a babble noise sound was used as background noise. All the sounds were calibrated using an HSU III.2 artificial head. The DRT stimuli were calibrated at 60 dB (A) to simulate normal speech at 1 m distance [31], whereas the babble noise was calibrated at approximately 73 dB (A), resulting in a signal-to-noise ratio of approximately 13 dB (A), which can be considered as a moderate noise disturbance ratio [23].
A total of 144 video stimuli (36 words per 4 conditions) were prepared for the test and loaded into Psychopy software 2024.2.3 [32] for playback, randomization, and scoring collection. Each stimulus lasted about 5 s (some examples can be found in Supplementary Materials).

3. Methodology

3.1. Participants

Thirty-five participants (15 females, M = 29.1; SD = 6.6) were recruited among the students and the personnel of the Department of Architecture and Industrial Design at the Università degli Studi della Campania “Luigi Vanvitelli”. Participants were tested for normal hearing capabilities before taking the test. All participants gave their written consent to take part in the study. The study was carried out in conformity with experimental protocol approved by the local Ethics Committee of the Department of Architecture and Industrial Design.

3.2. Procedure

The experiment was carried out in the test room of the Sens-i lab [33], in the Architecture Department of the Università degli Studi della Campania “Luigi Vanvitelli”. A laptop was positioned in the center of the room. First, participants were asked to test their auditory capabilities employing the Sennheiser Hearing Test app. None of the participants were excluded from the test. Participants were asked to sit in front of a laptop screen, wear headphones (Sennheiser HD 200, Sennheiser, Wedemark, Germany), and press the spacebar whenever they felt ready to start the experiment. At this point, by means of the Psychopy software, participants were asked to evaluate all 144 stimuli by means of the keyboard in front of them. After each stimulus is played, the screen prompts a choice between two possible words (see Figure 3). Participants could choose their response by tapping the left or right key on the keyboard according to the word that they thought the avatar had pronounced. Participants could take a break whenever they wanted. Each session lasted about 45 min.

4. Results

A first overall repeated-measures ANOVA was conducted to test whether there was an effect of the animation method used on the intelligibility of the avatar. The analysis design featured one within-subjects variable: Animation Type (MHA, MHLL, ADMH, and SADMH). The dependent variable was the mean intelligibility score for the 36 words provided for each animation type. The formula provided by Bonaventura [27] and used in almost all the studies using the DRT [34,35] was used to calculate the intelligibility scores and account for random choices of the participants:
S = 100 × ( T 2 W ) T
  • S = “Score”;
  • W = Wrong answers;
  • T = Total number of answers.
Descriptive analysis showed the ADMH condition to have the higher IS (M = 50.114, SD = 16.210), followed by MHA (M = 38.400, SD = 17.078), SADMH (M = 33.829, SD = 19.943), and MHLL (M = 20.686, SD = 16.963). Within-subjects effects showed a statistically significant main effect for the type of animation used: F (3,102) = 22.550, p ≤ 0.001, η2p = 0.399.
The results of the Bonferroni post hoc test showed that the ADMH condition was the most intelligible animation type when compared to the MHA condition (M = 11.71, SE = 3.372, p = 0.009), SADMH condition (M = 16.286, SE = 3.168, p ≤ 0.001), and the MHLL condition (M = 29.429, SE = 3.138, p ≤ 0.001) (see Figure 4).
A second repeated-measures ANOVA was conducted to consider the trait distinctions in the analysis and their effects on the intelligibility of the animations. The analysis design featured two within-subjects variables: Animation Type (MHA, MHLL, ADMH, and SADMH) and Trait (Nasalità, Continuità, Stridulità, Coronalità, Anteriorità, Sonorità). The dependent variable was the mean score per each trait, meaning that only the first word of the pair was considered for the calculations (see Table 1). Descriptive analysis shows that ADMH had the higher scores (M = 72.540, SD = 28.472), followed by MHA (M = 67.302, SD = 27.468), SADMH (M = 63.175, SD = 31.250), and MHLL (M = 58.413, SD = 31.876).
Within-subjects effects showed the main effect of the Animation Type: F (3,102) = 11.287, p ≤ 0.001, η2p = 0.249; and of the Trait: F (5,170) = 10.665, p ≤ 0.001, η2p = 0.239. Furthermore, an interaction effect was found for the Animation Type × Trait: F (15,510) = 6.828, p ≤ 0.001, η2p = 0.167 (see Table 2).
For the Animation Type, the Bonferroni post hoc test showed the MHA condition to be more intelligible than the MHLL condition (M = 8.938, SE = 2.523, p = 0.007), and the MHLL condition was shown to be less intelligible than the ADMH condition (M = −14.190, SE = 2.581, p ≤ 0.001). Moreover, the ADMH condition was significantly more intelligible than the SADMH condition (M = 9.371, SE = 2.425, p ≤ 0.003) (see Figure 5).
For the traits, descriptive analysis shows Anteriorità to be the most intelligible among the traits considered (M = 72.619, SD = 27.069), followed by Coronalità (M = 70.238, SD = 28.191), Continuità (M = 67.381, SD = 27.961), Sonorità (M = 67.143, SD = 30.756), Stridulità (M = 65.238, SD = 32.143), and Nasalità (M = 49.524, SD = 29.533). The Bonferroni post hoc test showed that the Nasalità trait was significantly less intelligible than all other traits—more specifically, when compared with Continuità (M = −17.90, SE = 3.870, p ≤ 0.001), Stridulità (M = −15.736, SE = 3.607, p = 0.002), Coronalità (M = −20.757, SE = 3.683, p ≤ 0.001), Anteriorità (M = −23.171, SE = 3.540, p ≤ 0.001), and Sonorità (M = −17.664, SE = 3.599, p ≤ 0.001) (see Figure 6).
Considering the Animation Type × Trait interaction, the Bonferroni post hoc test for the Continuità trait showed a statistically significant difference in the MHLL condition, being less intelligible than the ADMH (M = −16.257, SE = 5.373, p = 0.028) and the SADMH (M = −17.200, SE = 6.029, p = 0.044) conditions.
For the Stridulità Trait, the Bonferroni post hoc test showed the MHLL condition to be significantly less intelligible than all other conditions—specifically, with the MHA condition (M = −34.371, SE = 7.580, p ≤ 0.001), with the ADMH condition (M = −42.943, SE = 6.788, p ≤ 0.001), and with the SADMH condition (M = −43.00, SE = 6.369, p ≤ 0.001).
For the Anteriorità trait, the Bonferroni post hoc test showed the MHLL condition to be more intelligible than the ADMH (M = 17.200, SE = 5.707, p = 0.029).
For the Sonorità trait, the Bonferroni post hoc test showed the MHA to be more intelligible than the SADMH condition (M = 28.657, SE = 6.743, p ≤ 0.001); also, the ADMH was shown to be more intelligible than the SADMH condition (M = 36.257, SE = 6.903, p ≤ 0.001). Furthermore, the MHLL condition was more intelligible than SADMH (M = 20.943, SE = 7.277, p = 0.041).
No statistical significance emerged between the four Animation Types and the traits Nasalità and Coronalità (see Figure 7).

5. Discussion

The present study aimed to understand optimal facial animation methodologies for avatars. To this aim, several Epic Games MetaHumans’ methodologies were employed for the animations, using audio–visual (MHA and MHLL) and audio-only inputs (ADMH and SADMH). The Diagnostic Rhyme Test was chosen to provide the input stimuli to feed the different methodologies and to be used as the evaluation tool for the produced animations. The results indicate that the ADMH condition may offer advantages in enhancing the speech intelligibility of virtual avatars under challenging listening conditions. The higher intelligibility scores observed for the ADMH method may be attributable to its ability to produce consistent and clearly defined lip movements.
On the other hand, the analysis of trait distinctions (Nasalità, Continuità, Stridulità, Coronalità, Anteriorità, and Sonorità) reveals that not all the stimuli benefit equally from the methodology used. For instance, the words associated with the traits Continuità and Stridulità, which represent, respectively, continuous outgoing airflow and higher-pitched noise, were shown to be significantly less intelligible in the MHLL condition. Moreover, the MHLL condition performed better than the ADMH condition when considering the trait Anteriorità, i.e., when producing a consonant in which the alveolar region is either more or less obstructed. Furthermore, when considering the trait Sonorità, associated with the vibration of vocal cords, the synthetic voice condition performed worse than all other conditions. Finally, the Nasalità and Coronalità traits did not show statistically significant differences between the conditions.
Similarly, a previous Italian study used the DRT to test intelligibility with only audio stimuli and using different filtering techniques; it also showed no significant differences in intelligibility when considering the traits Nasalità and Coronalità [34]. It must be noted that, while the DRT has been extensively used in audio-based studies in different linguistic and phonetic contexts, comprehensive research that integrates the intersection of the Diagnostic Rhyme Test, its distinctive traits, and articulatory phonetics remains significantly underexplored in the scientific literature. This gap highlights the need for further investigation to better understand their combined impact on intelligibility assessment and speech perception.
The results suggest that, for certain phonetic characteristics, the articulatory movements required are less dependent on visual cues or that the differences between animation methods are not pronounced enough to affect participants’ perception. The higher intelligibility of the ADMH condition could be because it produces more uniform lip movements that potentially enhance speech intelligibility. At the same time, this methodology lacks a reproduction of subtle variations and dynamics present in human facial articulatory movements, which were best represented in the MHA condition. Such dynamics are very important for conveying emotional content but may introduce variability that results in complicating visual decoding. This may be in line with the previous literature that highlighted how, the more realistic the avatars’ movement, the more distracted the user, resulting in worsening performances in learning tasks [36,37].
Current results could open an interesting line of inquiry regarding how best to balance the need for clarity with the desire for naturalistic animation in virtual avatars.

6. Conclusions

The results of the study indicate that the audio-driven facial animation methodology (ADMH) produced the most intelligible animations among the evaluated methods. One possible interpretation is that the ADMH approach generates more consistent and clearly defined lip movements, which may help listeners to more accurately map the visual cues to corresponding phonetic elements. Although this method might result in animations that are more consistent, it may also lose part of the subtle movements that contribute to the perceived naturalness of a speaker.
Another consideration regards the use of standardized disyllabic word pairs from the Diagnostic Rhyme Test. Real-world communication involves continuous speech with varying intonations, which may interact differently with visual cues. It is conceivable that the results observed in the methodologies considered might vary when tested with more complex linguistic materials or different noise conditions.
Possible future investigations that could help to overcome some limitations of this study should include a control condition with a real recording of an actor, as well as the qualitative assessment of the avatars to consider a potential “Uncanny Valley effect” [38], along with the phonetic competence of participants. Future research should also consider incorporating continuous speech and diverse acoustic environments to further explore these relationships to determine whether the observed benefits of ADMH can be generalized to everyday communication scenarios.
These results provide an initial foundation for future research aimed at developing more effective avatar-based communication systems for both educational and professional settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/acoustics7020030/s1, Video S1: ADMH_CIONCA, Video S2: MHA_CIONCA, Video S3: MHLL_CIONCA, Video S4: SADMH_CIONCA.

Author Contributions

Conceptualization, F.C. and M.M.; Data curation, F.C. and M.M.; Investigation, F.C.; Methodology, F.C. and M.M.; Project administration, M.M., A.P., and L.M.; Resources, M.M., A.P., and L.M.; Software, F.C., M.M., and A.P.; Supervision, L.M.; Validation, F.C. and M.M.; Writing—original draft, F.C. and M.M.; Writing—review and editing, M.M., A.P., and L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministero dell’Università e della Ricerca: I.3.3 Borse PNRR Dottorati innovativi che rispondono ai fabbisogni di innovazione delle imprese. Grant number 39-033-49-DOT22B2TTX-9702.

Informed Consent Statement

Written informed consent was obtained from the participants. The study was approved by the CERS (Comitato Etico per la Ricerca Scientifica) Ethical Committee of the Department of Architecture and Industrial Design of the Università degli Studi della Campania “Luigi Vanvitelli”. Pr. No: CERS-2024-05.

Data Availability Statement

The dataset generated and analyzed during the current study is available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank Nicola Prodi and Chiara Visentin for providing access to the material of the Diagnostic Rhyme Test used in this study.

Conflicts of Interest

Aniello Pascale is the COO and founder of Immensive. The paper reflects the views of the scientists and not the company.

References

  1. Bernstein, L.E.; Auer, E.T., Jr.; Takayanagi, S. Auditory speech detection in noise enhanced by lipreading. Speech Commun. 2004, 44, 5–18. [Google Scholar] [CrossRef]
  2. Ma, W.J.; Zhou, X.; Ross, L.A.; Foxe, J.J.; Parra, L.C. Lip-reading aids word recognition most in moderate noise: A Bayesian explanation using high-dimensional feature space. PLoS ONE 2009, 4, e4638. [Google Scholar] [CrossRef]
  3. Okada, K.; Matchin, W.; Hickok, G. Neural evidence for predictive coding in auditory cortex during speech production. Psychon. Bull. Rev. 2018, 25, 423–430. [Google Scholar] [CrossRef] [PubMed]
  4. Chandrasekaran, C.; Lemus, L.; Ghazanfar, A.A. Dynamic faces speed up the onset of auditory cortical spiking responses during vocal detection. Proc. Natl. Acad. Sci. USA 2013, 110, E4668–E4677. [Google Scholar] [CrossRef]
  5. McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef] [PubMed]
  6. Anderson Gosselin, P.; Gagné, J.P. Older adults expend more listening effort than young adults recognizing speech in noise. J. Speech Lang. Hear. Res. 2011, 54, 944–958. [Google Scholar] [CrossRef]
  7. Tristán-Hernández, E.; Pavón García, I.; López Navarro, J.M.; Campos-Cantón, I.; Kolosovas-Machuca, E.S. Evaluation of psychoacoustic annoyance and perception of noise annoyance inside university facilities. Int. J. Acoust. Vib. 2018, 23, 3–8. [Google Scholar] [CrossRef]
  8. Lamotte, A.-S.; Essadek, A.; Shadili, G.; Perez, J.-M.; Raft, J. The impact of classroom chatter noise on comprehension: A systematic review. Percept. Mot. Skills 2021, 128, 1275–1291. [Google Scholar] [CrossRef]
  9. Hodgson, M.; Rempel, R.; Kennedy, S. Measurement and prediction of typical speech and background-noise levels in university classrooms during lectures. J. Acoust. Soc. Am. 1999, 105, 226–233. [Google Scholar] [CrossRef]
  10. Choudhary, Z.D.; Bruder, G.; Welch, G.F. Visual facial enhancements can significantly improve speech perception in the presence of noise. IEEE Trans. Vis. Comput. Graph. 2023, 29, 4751–4760. [Google Scholar] [CrossRef]
  11. Guastamacchia, A.; Riente, F.; Shtrepi, L.; Puglisi, G.E.; Pellerey, F.; Astolfi, A. Speech intelligibility in reverberation based on audio-visual scenes recordings reproduced in a 3D virtual environment. Build. Environ. 2024, 258, 111554. [Google Scholar] [CrossRef]
  12. Visentin, C.; Prodi, N.; Cappelletti, F.; Torresin, S.; Gasparella, A. Speech intelligibility and listening effort in university classrooms for native and non-native Italian listeners. Build. Acoust. 2019, 26, 275–291. [Google Scholar] [CrossRef]
  13. Hirata, Y.; Kelly, S.D. Effects of lips and hands on auditory learning of second language speech sounds. J. Speech Lang. Hear. Res. 2010, 53, 298–310. [Google Scholar] [CrossRef]
  14. Buechel, L.L. Lip syncs: Speaking… with a twist. In English Teaching Forum; ERIC: Manhattan, MY, USA, 2019; Volume 57, pp. 46–52. [Google Scholar]
  15. Pelachaud, C.; Caldognetto, E.M.; Zmarich, C.; Cosi, P. Modelling an Italian talking head. In Proceedings of the AVSP 2001 International Conference on Auditory-Visual Speech Processin Processing, Scheelsminde, Denmark, 7–9 September 2001; pp. 72–77. [Google Scholar]
  16. Lv, C.; Wu, Z.; Wang, X.; Zhou, M. 3D facial expression modeling based on facial landmarks in single image. Neurocomputing 2019, 355, 155–167. [Google Scholar] [CrossRef]
  17. Ling, J.; Wang, Z.; Lu, M.; Wang, Q.; Qian, C.; Xu, F. Semantically disentangled variational autoencoder for modeling 3D facial details. IEEE Trans. Vis. Comput. Graph. 2022, 29, 3630–3641. [Google Scholar] [CrossRef]
  18. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
  19. Fan, Y.; Tao, J.; Yi, J.; Wang, W.; Komura, T. FaceFormer: Speech-driven 3D facial animation with transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  20. Ward, W.; Cole, R.; Bolaños, D.; Buchenroth-Martin, C.; Svirsky, E.; Weston, T. My science tutor: A conversational multimedia virtual tutor. J. Educ. Psychol. 2013, 105, 1115. [Google Scholar] [CrossRef]
  21. Peng, X.; Chen, H.; Wang, L.; Wang, H. Evaluating a 3-D virtual talking head on pronunciation learning. Int. J. Hum.-Comput. Stud. 2018, 109, 26–40. [Google Scholar] [CrossRef]
  22. Devesse, A.; Dudek, A.; van Wieringen, A.; Wouters, J. Speech intelligibility of virtual humans. Int. J. Audiol. 2018, 57, 914–922. [Google Scholar] [CrossRef]
  23. Schiller, I.S.; Breuer, C.; Aspöck, L.; Ehret, J.; Bönsch, A.; Kuhlen, T.W.; Schlittmeier, S.J. A lecturer’s voice quality and its effect on memory, listening effort, and perception in a VR environment. Sci. Rep. 2024, 14, 12407. [Google Scholar] [CrossRef]
  24. International Epic Games. Epic Games Metahuman Creator. Available online: https://metahuman.unrealengine.com (accessed on 21 October 2024).
  25. Alkawaz, M.H.; Mohamad, D.; Basori, A.H.; Saba, T. Blend shape interpolation and FACS for realistic avatar. 3D Res. 2015, 6, 6. [Google Scholar] [CrossRef]
  26. Purushothaman, R. Morph animation and facial rigging. In Character Rigging and Advanced Animation: Bring Your Character to Life Using Autodesk 3ds Max; Apress: New York, NY, USA, 2019; pp. 243–274. [Google Scholar]
  27. Bonaventura, P.; Paoloni, A.; Canavesio, F.; Usai, P. Realizzazione di un Test Diagnostico di Intelligibilità per la Lingua Italiana; Fondazione Ugo Bordoni: Roma, Italy, 1986. [Google Scholar]
  28. Astolfi, A.; Bottalico, P.; Barbato, G. Subjective and objective speech intelligibility investigations in primary school classrooms. J. Acoust. Soc. Am. 2012, 131, 247. [Google Scholar] [CrossRef] [PubMed]
  29. Prodi, N.; Visentin, C.; Farnetani, A. Intelligibility, listening difficulty and listening efficiency in auralized classrooms. J. Acoust. Soc. Am. 2010, 128, 172–181. [Google Scholar] [CrossRef] [PubMed]
  30. ElevenLabs. ElevenLabs: AI Text-to-Speech Platform. Available online: https://elevenlabs.io (accessed on 15 November 2024).
  31. ISO 9921:2003; Ergonomics—Assessment of Speech Communication. International Organization for Standardization: Geneva, Switzerland, 2003.
  32. Peirce, J.W.; Gray, J.R.; Simpson, S.; MacAskill, M.R.; Höchenberger, R.; Sogo, H.; Kastman, E.; Lindeløv, J. PsychoPy2: Experiments in behavior made easy. Behav. Res. Methods 2019, 51, 195–203. [Google Scholar] [CrossRef]
  33. Maffei, L.; Masullo, M. Sens i-Lab: A key facility to expand the traditional approaches in experimental acoustics. In INTER-NOISE and NOISE-CON Congress and Conference; Institute of Noise Control Engineering: Grand Rapids, MI, USA, 2023. [Google Scholar]
  34. Grasso, C.; Quaglia, D.; Farinetti, L.; Fiorio, G.; De Martin, J.C. Wide-band compensation of presbycusis. In Signal Processing, Pattern Recognition and Applications; ACTA Press: Calgary, AB, Canada, 2003. [Google Scholar]
  35. Kondo, K. Estimation of speech intelligibility using objective measures. Appl. Acoust. 2013, 74, 63–70. [Google Scholar] [CrossRef]
  36. Peixoto, B.; Melo, M.; Cabral, L.; Bessa, M. Evaluation of animation and lip-sync of avatars, and user interaction in immersive virtual reality learning environments. In Proceedings of the 2021 International Conference on Graphics and Interaction (ICGI), Porto, Portugal, 4–5 November 2021; IEEE: New York, NY, USA, 2021; pp. 1–7. [Google Scholar]
  37. Makransky, G.; Terkildsen, T.S.; Mayer, R.E. Adding immersive virtual reality to a science lab simulation causes more presence but less learning. Learn. Instr. 2019, 60, 225–236. [Google Scholar] [CrossRef]
  38. Mori, M.; MacDorman, K.F.; Kageki, N. The uncanny valley [From the Field]. IEEE Robot. Autom. Mag. 2012, 19, 98–100. [Google Scholar] [CrossRef]
Figure 1. Recording setup (left); facial capture connected to avatar example (right).
Figure 1. Recording setup (left); facial capture connected to avatar example (right).
Acoustics 07 00030 g001
Figure 2. Graph of the flows used to generate facial animations for the avatar.
Figure 2. Graph of the flows used to generate facial animations for the avatar.
Acoustics 07 00030 g002
Figure 3. Example of the experimental setup (left); example of word-pair prompt (right).
Figure 3. Example of the experimental setup (left); example of word-pair prompt (right).
Acoustics 07 00030 g003
Figure 4. Overall intelligibility scores based on animation conditions.
Figure 4. Overall intelligibility scores based on animation conditions.
Acoustics 07 00030 g004
Figure 5. Intelligibility scores among animation conditions compared in terms of the traits.
Figure 5. Intelligibility scores among animation conditions compared in terms of the traits.
Acoustics 07 00030 g005
Figure 6. Intelligibility scores based on the traits.
Figure 6. Intelligibility scores based on the traits.
Acoustics 07 00030 g006
Figure 7. Intelligibility scores’ distributions for the different trait categories.
Figure 7. Intelligibility scores’ distributions for the different trait categories.
Acoustics 07 00030 g007
Table 1. Distinctive traits considered in the Italian version of the Diagnostic Rhyme Test (in bold).
Table 1. Distinctive traits considered in the Italian version of the Diagnostic Rhyme Test (in bold).
TraitDescriptionExample
NasalitàWhether the air flows through the nasal cavity.Nido/Lido
ContinuitàWhether the air flows through the oral cavity in a prolonged way over time.Riso/Liso
StridulitàWhether airflow passes through a small slit between two very close surfaces.Cina/China
AnterioritàWhether the alveolar region is obstructed.Nesso/Messo
CoronalitàWhether the coronal part of the tongue is raised compared to its resting position.Sisma/Scisma
SonoritàWhether vocal cords are close together and therefore vibrate due to the airflow during sound production.Vino/Fino
Table 2. Within-subjects effects for the repeated-measures ANOVA.
Table 2. Within-subjects effects for the repeated-measures ANOVA.
CasesSum of SquaresdfMean SquareFpη2p
Animation Type2.46030.82011.287<0.0010.249
Residuals7.5001020.074
Trait4.95250.99010.665<0.0010.239
Residuals15.3911700.091
Animation Type × Trait7.584150.5066.828<0.0010.167
Residuals38.2485100.075
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cioffi, F.; Masullo, M.; Pascale, A.; Maffei, L. Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation. Acoustics 2025, 7, 30. https://doi.org/10.3390/acoustics7020030

AMA Style

Cioffi F, Masullo M, Pascale A, Maffei L. Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation. Acoustics. 2025; 7(2):30. https://doi.org/10.3390/acoustics7020030

Chicago/Turabian Style

Cioffi, Federico, Massimiliano Masullo, Aniello Pascale, and Luigi Maffei. 2025. "Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation" Acoustics 7, no. 2: 30. https://doi.org/10.3390/acoustics7020030

APA Style

Cioffi, F., Masullo, M., Pascale, A., & Maffei, L. (2025). Speech Intelligibility in Virtual Avatars: Comparison Between Audio and Audio–Visual-Driven Facial Animation. Acoustics, 7(2), 30. https://doi.org/10.3390/acoustics7020030

Article Metrics

Back to TopTop