Next Article in Journal
License Plate Detection Based on Improved YOLOv8n Network
Next Article in Special Issue
Sound Source Localization Using Hybrid Convolutional Recurrent Neural Networks in Undesirable Conditions
Previous Article in Journal
Robotic Ultrasound Diagnostic System for Non-Destructive Testing in Highly Variable Production
Previous Article in Special Issue
Simultaneous Localization of Two Talkers Placed in an Area Surrounded by Asynchronous Six-Microphone Arrays
 
 
Article
Peer-Review Record

The Role of Facial Action Units in Investigating Facial Movements During Speech

Electronics 2025, 14(10), 2066; https://doi.org/10.3390/electronics14102066
by Aliya A. Newby *, Ambika Bhatta, Charles Kirkland III, Nicole Arnold and Lara A. Thompson
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2025, 14(10), 2066; https://doi.org/10.3390/electronics14102066
Submission received: 25 March 2025 / Revised: 12 May 2025 / Accepted: 12 May 2025 / Published: 20 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper aimed to determine facial action unit responses during unimpaired subjects' speech. Video recordings were made of the subjects' faces speaking 17 different phonemes and repeating them 6 times. AU amplitudes were detected using the PyFeat Python package. Mean amplitudes across frames were calculated and presented in box plots and tables. In summary, an interesting idea has been proposed: studying phoneme pronunciation based on visual information. 

On the other hand, because the article has been submitted to the journal Electronics, it should contain more technical details. The introduction, which describes the motivation for the research, could be shortened and include fewer references. But it should be supplemented with background information on how other researchers have studied the variation of facial features while pronouncing different phonemes. Similar studies exist, as there have been efforts to create a talking human face and to identify phonemes using visemes and so on. This would allow us to compare the results achieved with SOTA and to draw more meaningful conclusions.

 

Author Response

Comment 1:

The paper aimed to determine facial action unit responses during unimpaired subjects' speech. Video recordings were made of the subjects' faces speaking 17 different phonemes and repeating them 6 times. AU amplitudes were detected using the PyFeat Python package. Mean amplitudes across frames were calculated and presented in box plots and tables. In summary, an interesting idea has been proposed: studying phoneme pronunciation based on visual information. 

On the other hand, because the article has been submitted to the journal Electronics, it should contain more technical details. The introduction, which describes the motivation for the research, could be shortened and include fewer references. But it should be supplemented with background information on how other researchers have studied the variation of facial features while pronouncing different phonemes. Similar studies exist, as there have been efforts to create a talking human face and to identify phonemes using visemes and so on. This would allow us to compare the results achieved with SOTA and to draw more meaningful conclusions.

Response 1:

Per your recommendation, we supplemented the background information within the paper tied to how other researchers have studied facial feature variations for speech (lines 109– 131)

The studies below use vastly different approaches, and with different goals than our study which utilize facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). An audio-visual speech recognition method previously utilized side-face images of the lips which are captured while the audio is recorded using a microphone. Lip contour geometric features (LCGFs) used for discriminating phonemes, and lip movement velocity features (LMVFs) used for detecting voice activity were used to extract both audio and visual features. These features were used to assess speech using the Hidden Markov Models (HMM)[22]. Another study designed and compared two HMM classifiers for four emotions which have varying effects on the properties of different phonemes. The emotional states during speech were classified using phoneme-level modeling [23]. Phonetic feature-based prediction models are used to predict and understand variations in the pronunciation of phonemes. This allows for better modeling of spontaneous speech by focusing on individual articulatory features [24]. Another approach focuses on identifying and identifying the relationship between speech sounds and facial movements using a fine-grained statistical correlation analysis between various phonemes and facial anthropometric measurements (AMs) [25]. Another alternative to phoneme analysis using facial movements is the viseme model. Viseme maps are particularly useful in supporting accurate visual feature extraction for the visual representation of speech [26,27].

 

References:

  1. Iwano, T. Yoshinaga, S. Tamura, and S. Furui, “Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images,” J AUDIO SPEECH MUSIC PROC., vol. 2007, no. 1, Art. no. 1, Dec. 2007, doi: 10.1155/2007/64506.
  2. M. Lee et al., “Emotion recognition based on phoneme classes,” in Interspeech 2004, ISCA, Oct. 2004, pp. 889–892. doi: 10.21437/Interspeech.2004-322.
  3. A. Bates, M. Ostendorf, and R. A. Wright, “Symbolic phonetic features for modeling of pronunciation variation,” Speech Communication, vol. 49, no. 2, pp. 83–97, Feb. 2007, doi: 10.1016/j.specom.2006.10.007.
  4. Qu, X. Zou, X. Li, Y. Wen, R. Singh, and B. Raj, “The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features,” Jul. 26, 2023, arXiv: arXiv:2307.13953. doi: 10.48550/arXiv.2307.13953.
  5. C. Yau, D. K. Kumar, and S. P. Arjunan, “Visual recognition of speech consonants using facial movement features,” ICA, vol. 14, no. 1, pp. 49–61, Jan. 2007, doi: 10.3233/ICA-2007-14105.
  6. Cappelletta and N. Harte, “PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION,” presented at the International Conference on Pattern Recognition Applications and Methods, SCITEPRESS, Feb. 2012, pp. 322–329. doi: 10.5220/0003731903220329.

Reviewer 2 Report

Comments and Suggestions for Authors

In this manuscript, there is a potentially promising research direction involving the use of facial action units (AUs) to understand impaired speech production. While the current study focuses on a small sample of unimpaired participants, it nonetheless highlights the broader implications that such an approach could have for future research in both clinical and theoretical domains.

However, the results emerging from this preliminary, proof-of-concept study are analyzed only at a descriptive level, primarily through boxplots that display mean amplitude values for selected phonemes, pooled across participants and repetitions. While this approach may serve to illustrate general trends, it does not account for inter-subject variability or provide inferential support for observations.

Pooling data across participants and repetitions, without modeling individual-level effects, may obscure important within- and between-subject differences. Moreover, the absence of statistical models limits the interpretability and generalizability of the results. To strengthen the empirical foundation of the study, I recommend incorporating more refined analyses, such as linear mixed-effects models, which can account for repeated measures and subject-specific variability. Additionally, reporting confidence intervals or standard errors in the graphical displays would enhance the transparency and robustness of the findings.

Such improvements would make the study more compelling and lay a stronger groundwork for future applications in clinical populations, where the methodological rigor becomes even more critical.

As a minor and local comment, at line 39 I would not speak about client, rather patient or subject.

As a general comment, I suggest the authors to revise the paper and resubmit it to another Journal of the MDPI Publishing Group, as for example "Acoustics".

Author Response

Comment 1:

In this manuscript, there is a potentially promising research direction involving the use of facial action units (AUs) to understand impaired speech production. While the current study focuses on a small sample of unimpaired participants, it nonetheless highlights the broader implications that such an approach could have for future research in both clinical and theoretical domains.

Response 1:

We thank you for your comment.

Comment 2:

However, the results emerging from this preliminary, proof-of-concept study are analyzed only at a descriptive level, primarily through boxplots that display mean amplitude values for selected phonemes, pooled across participants and repetitions. While this approach may serve to illustrate general trends, it does not account for inter-subject variability or provide inferential support for observations.

Pooling data across participants and repetitions, without modeling individual-level effects, may obscure important within- and between-subject differences. Moreover, the absence of statistical models limits the interpretability and generalizability of the results. To strengthen the empirical foundation of the study, I recommend incorporating more refined analyses, such as linear mixed-effects models, which can account for repeated measures and subject-specific variability. Additionally, reporting confidence intervals or standard errors in the graphical displays would enhance the transparency and robustness of the findings.

Such improvements would make the study more compelling and lay a stronger groundwork for future applications in clinical populations, where the methodological rigor becomes even more critical.

Response 2:

We appreciate your feedback. 

Our results, displayed within this preliminary proof-of-concept study, are depicted here through boxplots and mean amplitudes, pooled across participants and trials. We examined intersubject variability in that the boxplots depict the median and standard deviations across the group of participants. Outliers were then excluded, and the plots shown in Figures 3-5 display our results. Intrasubject variability, the variability in trials within an individual, is also important. However, here we were interested in establishing a baseline of AU responses for phonemes across a group of unimpaired speakers.

Comment 3:

As a minor and local comment, at line 39 I would not speak about client, rather patient or subject.

Response 3: The authors appreciate the reviewer identifying misused word. Line 49-52 now states that:

“Currently, how speech language pathologists (SLPs) assess speech disorders is through subjective yet comprehensive assessment, case history of the subject and their family, performing an oral mechanism examination, hearing screening, and speech sound assessment.”

 

Comment 4:

As a general comment, I suggest the authors to revise the paper and resubmit it to another Journal of the MDPI Publishing Group, as for example, "Acoustics".

Response 4: We felt that our paper was appropriate to submit to the Electronics Special Issue on "Recent Advances in Audio, Speech and Music Processing and Analysis" due to our topic within the paper which encompassed audio and speech analysis via AUs. We appreciate the reviewer’s suggestion to submit the manuscript to the Acoustics journal.

Reviewer 3 Report

Comments and Suggestions for Authors

Reviewer 1

Comments to the author:

Manuscript ID: electronics-3578491

Title: “The Role of Facial Action Units toward Investigating Facial Movements During Speech”

This study aimed to investigate the role of facial muscles during speech. This topic is of great interest to speech-language pathologists/ therapists who work on the clinical level with children and adults with motor speech disorders.

 
General comments

Research about other alternative techniques of visualization of speech movements must be incorporated in the introduction. Also, the authors should explain how this visualization method differs from others and what are the benefits of the use of this new technique in comparison to the others already applied in speech therapy of children with speech disorders.   

Introduction

Line 27. Delete the word unimpaired. It is not necessary.

Line 31. A citation is missing.

Line 37. Childhood apraxia of speech and developmental dysarthria are co-occurring conditions in many organic syndromes such as Down syndrome, Williams-Beuren syndrome, Cri-du-Chat syndrome etc. that are also presenting feeding difficulties. The authors must add this information to the introduction section.

Lines 40-41. The protocol of speech assessment is presented in several clinical handbooks. Add a reference.

Line 42. Add a reference for error analysis.

Lines 43-44. The severity of a speech disorder is also evaluated with the use of indices such as the Percentage of Consonant Correct, the Whole-Word-Match, etc which are objective measures and quantify adequately the severity of speech impairment.

Lines 38-57. References are missing

Lines 75. Other alternative visualization methods that have been used for visualizing speech movements are the electropalatograpgy, the palatography and the 3D-palatography. These methods have been applied also in therapy.

The authors must add this information to their manuscript. They also must explain how the technique they propose is superior in comparison to “traditional” electropalatographic/ palatographic techniques.

Lines 127-128. Please, clarify if the participants were monolinguals or if they were dialectophons. Also, clarify if they spoke the standard American English or British English language. This piece of information is important since the participant's pronunciation may influence significantly mainly the production of vowels and diphthongs examined.

Line 198. Please, describe the technical characteristics of the computer, the camera, and the microphones used.

Lines 318-319. The authors did not discuss in their limitation section, that their sample included mostly African/ American adults. There are observable differences in the facial-skeletal features of different races, and different races and cultures express emotions using facial expressions, differently. These differences may have influenced the generalization of their findings.

 

 

Author Response

Comment 1:

Research about other alternative techniques of visualization of speech movements must be incorporated in the introduction.

Response 1:

Per your recommendation, we supplemented the background information within the paper tied to how other researchers have studied facial feature variations for speech (lines 109– 131)

The studies below use vastly different approaches, and with different goals than our study which utilize facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). An audio-visual speech recognition method previously utilized side-face images of the lips which are captured while the audio is recorded using a microphone. Lip contour geometric features (LCGFs) used for discriminating phonemes, and lip movement velocity features (LMVFs) used for detecting voice activity were used to extract both audio and visual features. These features were used to assess speech using the Hidden Markov Models (HMM)[22]. Another study designed and compared two HMM classifiers for four emotions which have varying effects on the properties of different phonemes. The emotional states during speech were classified using phoneme-level modeling [23]. Phonetic feature-based prediction models are used to predict and understand variations in the pronunciation of phonemes. This allows for better modeling of spontaneous speech by focusing on individual articulatory features [24]. Another approach focuses on identifying and identifying the relationship between speech sounds and facial movements using a fine-grained statistical correlation analysis between various phonemes and facial anthropometric measurements (AMs) [25]. Another alternative to phoneme analysis using facial movements is the viseme model. Viseme maps are particularly useful in supporting accurate visual feature extraction for the visual representation of speech [26,27].

 

References:

  1. Iwano, T. Yoshinaga, S. Tamura, and S. Furui, “Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images,” J AUDIO SPEECH MUSIC PROC., vol. 2007, no. 1, Art. no. 1, Dec. 2007, doi: 10.1155/2007/64506.
  2. M. Lee et al., “Emotion recognition based on phoneme classes,” in Interspeech 2004, ISCA, Oct. 2004, pp. 889–892. doi: 10.21437/Interspeech.2004-322.
  3. A. Bates, M. Ostendorf, and R. A. Wright, “Symbolic phonetic features for modeling of pronunciation variation,” Speech Communication, vol. 49, no. 2, pp. 83–97, Feb. 2007, doi: 10.1016/j.specom.2006.10.007.
  4. Qu, X. Zou, X. Li, Y. Wen, R. Singh, and B. Raj, “The Hidden Dance of Phonemes and Visage: Unveiling the Enigmatic Link between Phonemes and Facial Features,” Jul. 26, 2023, arXiv: arXiv:2307.13953. doi: 10.48550/arXiv.2307.13953.
  5. C. Yau, D. K. Kumar, and S. P. Arjunan, “Visual recognition of speech consonants using facial movement features,” ICA, vol. 14, no. 1, pp. 49–61, Jan. 2007, doi: 10.3233/ICA-2007-14105.
  6. Cappelletta and N. Harte, “PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION,” presented at the International Conference on Pattern Recognition Applications and Methods, SCITEPRESS, Feb. 2012, pp. 322–329. doi: 10.5220/0003731903220329.

Comment 2:

Also, the authors should explain how this visualization method differs from others and what are the benefits of the use of this new technique in comparison to the others already applied in speech therapy of children with speech disorders.   

Response 2:

The authors appreciate the reviewer’s comment to explain how the visualization method applied in the study differs from others. The details containing this addition are shown below and lines 109-131.

“The studies below use vastly different approaches, and with different goals than our study which utilize facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). An audio-visual speech recognition method previously utilized side-face images of the lips which are captured while the audio is recorded using a microphone. Lip contour geometric features (LCGFs) used for discriminating phonemes, and lip movement velocity features (LMVFs) used for detecting voice activity were used to extract both audio and visual features. These features were used to assess speech using the Hidden Markov Models (HMM) [22]. Another study designed and compared two HMM classifiers for four emotions which have varying effects on the properties of different phonemes. The emotional states during speech were classified using phoneme-level modeling [23]. Phonetic feature-based prediction models are used to predict and understand variations in the pronunciation of phonemes. This allows for better modeling of spontaneous speech by focusing on individual articulatory features [24]. Another approach focuses on identifying and identifying the relationship between speech sounds and facial movements using a fine-grained statistical correlation analysis between various phonemes and facial anthropometric measurements (AMs) [25]. Another alternative to phoneme analysis using facial movements is the viseme model. Viseme maps are particularly useful in supporting accurate visual feature extraction for the visual representation of speech [26,27].”

Reviewer Comment 3:

Line 27. Delete the word unimpaired. It is not necessary.

Response 3: The authors agree with the reviewer’s suggestion. The sentence containing the word “unimpaired” has been removed: line 32

“In everyday life, the ability to speak and be understood is something many of us take for granted.”

 

Comment 4:

Line 31. A citation is missing.

Response 4: We appreciate the reviewer recognizing that a citation needs to be added. The previous line 31 citation needed has been revised to: line 36

“Disorders wherein the muscles used for speaking are dysfunctional due to neurological damage, leads to difficulties with articulation, pronunciation, and clarity of speech.”

Comment 5:

Line 37. Childhood apraxia of speech and developmental dysarthria are co-occurring conditions in many organic syndromes such as Down syndrome, Williams-Beuren syndrome, Cri-du-Chat syndrome etc. that are also presenting feeding difficulties. The authors must add this information to the introduction section.

Response 5: The authors appreciate the reviewer’s comment to add in other speech disorders that present similarly to Apraxia of speech. The details containing this addition are shown below and lines 44-47.

“Childhood Apraxia of speech and developmental dysarthria are co-occuring conditions in many organic syndromes such as  Down Syndrome, Williams-Neuren Syndrome, and Cri-du-Chat syndrome that also present auxiliary difficulties [5,6].”

 

Comment 6:

 

Lines 40-41. The protocol of speech assessment is presented in several clinical handbooks. Add a reference.

Response 6 : The authors appreciate the reviewer’s comment to add a reference for the following:

“During the evaluation of speech, the accuracy of speech production, speech sound errors, and error patterns are observed by the SLP [7].”

 

Comment 7:

Line 42. Add a reference for error analysis.

Author Response 7: We appreciate the reviewer asking for a reference for this particular section. We have added a citation for line 42, now labeled as line 54.

the evaluation of speech, the accuracy of speech production, speech sound errors, and error patterns are observed by the SLP [7]. “

 

Comment 8:

Lines 43-44. The severity of a speech disorder is also evaluated with the use of indices such as the Percentage of Consonant Correct, the Whole-Word-Match, etc which are objective measures and quantify adequately the severity of speech impairment.

Response 8: We appreciate the reviewer’s additional information and this information has been added in line 56-58 and below:

“Speech is also evaluated with the use of objective measures  and indices such as Percentage of Consonant Correct, the whole-word-match [9].”

Comment 9:

Lines 38-57. References are missing

Author Response 9: We appreciate the reviewer’s request for additional references which are added to lines 49-74, previously lines 38-57.

Comment 10:

Lines 75. Other alternative visualization methods that have been used for visualizing speech movements are the electropalatograpgy, the palatography and the 3D-palatography. These methods have been applied also in therapy.

The authors must add this information to their manuscript. They also must explain how the technique they propose is superior in comparison to “traditional” electropalatographic/ palatographic techniques.

Response 10: Thank you for providing additional information that will help strength the content of this manuscript. We have added additional text to lines

 

 

Comment 11:

Lines 127-128. Please, clarify if the participants were monolinguals or if they were dialectophons. Also, clarify if they spoke the standard American English or British English language. This piece of information is important since the participant's pronunciation may influence significantly mainly the production of vowels and diphthongs examined.

Response 11:  The authors appreciate the reviewer’s comment involving participant’s pronunciation. Below is the additional text added to the manuscript:

Lines 172-179: “The participants are known to use only American English, and they represented an Ethnicity which showed uniform bilingual response. Hence, the need of further impact of languages spoken by them and pronunciation was considered small. It is true their facial-skeletal features can greatly influence the presented AU-acoustic based scheme. However, the objective of this work to identify the unique enablers in the approach across a uniform enthnogroup.”

 

Comment 12:

Line 198. Please, describe the technical characteristics of the computer, the camera, and the microphones used.

Response 12: We appreciate the reviewer asking additional questions regarding the technical characteristics of the data collection setup. The descriptions have been added to the manuscript and added below:

Line 237-240, 255-258:

“The desktop inbuilt camera was used to facilitate the position and circumvent the need of installing drivers for portable camera in different systems. The objective is develop a portable audio and video capturing system.”

The recorded video format is MP4, bit depth of 8, frame rate is 29.8 frames per second, and resolution is 1920 x 1080. The audio format is MP3 with number of channels 2, samples/frame is 1024 and sampling rate of 44100 Hz.”

Comment 13:

Lines 318-319. The authors did not discuss in their limitation section, that their sample included mostly African/ American adults. There are observable differences in the facial-skeletal features of different races, and different races and cultures express emotions using facial expressions, differently. These differences may have influenced the generalization of their findings.

Response 13:

Thank you.  We acknowledge this comment and feedback. 

 

 

Reviewer 4 Report

Comments and Suggestions for Authors

The authors of the article analyzed the potential use of facial action units (AUs), previously applied mainly in emotion recognition, for accurately describing facial movements accompanying speech. In the study, the facial expressions of 14 adults without speech impairments were recorded while articulating various phonemes, and the data was subsequently analyzed using the Py-Feat software. However, there are a few important issues for which I did not find answers and which should be addressed in the text of the article:

  1. A key limitation of the study is its small sample size (n=14), which restricts the generalizability of the findings to a wider population. Such a small number of participants makes it necessary to dismiss the study's findings as unreliable.
  2. Please provide a justification of how the results of the presented study could be helpful for individuals with speech disorders. Focusing solely on participants without speech impairments does not allow for the assessment of the method’s effectiveness in clinical practice. What specifically could the improvement of speech therapy and rehabilitation techniques for individuals with speech disorders involve, based on the described results?
  3. The boundary conditions of the conducted experiments were not provided.
    The article does not specify the video and audio recording parameters used during the experiments — for example, it lacks information on the audio sampling rate (e.g., 44.1 kHz or 48 kHz), video resolution (e.g., 1080p or 720p), frames per second (fps), or the models of the camera and microphone used. Similarly, the text does not mention which codecs were involved in signal conversion, the bit depth of the audio, or the duration of each recorded signal segment.
    Moreover, it is unclear whether the recordings were made in a studio-like environment insulated from ambient noise, light reflections, and other factors that can negatively affect human auditory and visual perception.
  4. Please include a comment in the article regarding how measurement conditions influence the behavior of Action Units (AUs). Can measurement conditions in an acoustically and visually controlled, isolated environment differ significantly from those conducted in an uncontrolled and non-isolated setting?
  5. Please correct the sentence: "Within daily living situations, the unimpaired ability to speak and to be understood are something many of us take for granted".
    It should rather be written as follows: "In everyday life, the ability to speak and be understood is something many of us take for granted".

Author Response

Reviewer Comment 1:
The authors of the article analyzed the potential use of facial action units (AUs), previously applied mainly in emotion recognition, for accurately describing facial movements accompanying speech. In the study, the facial expressions of 14 adults without speech impairments were recorded while articulating various phonemes, and the data was subsequently analyzed using the Py-Feat software. However, there are a few important issues for which I did not find answers and which should be addressed in the text of the article:
 Response 1: 
Thank you.  We acknowledge this comment and feedback.
Reviewer Comment 2:
A key limitation of the study is its small sample size (n=14), which restricts the generalizability of the findings to a wider population. Such a small number of participants makes it necessary to dismiss the study's findings as unreliable.

Author Response 2: 

The presented work is an exposition of developing AU based modeling of normal speech of sounds and words assuming the sample size, gender, age, and other variability are secondary as opposed to primary facial muscle activation. Hence, the accuracy and confidence of generalizing the model has not been stated and will be of greater importance followed by the presented work.

Reviewer Comment 3:
Please provide a justification of how the results of the presented study could be helpful for individuals with speech disorders. Focusing solely on participants without speech impairments does not allow for the assessment of the method’s effectiveness in clinical practice. What specifically could the improvement of speech therapy and rehabilitation techniques for individuals with speech disorders involve, based on the described results?

Author Response 3: We appreciate the reviewer’s questions regarding the boundary conditions of the data collection setup and the justification of the results presented in the study.
The improvement of speech therapy and rehabilitation techniques for individuals with speech disorders, using the approach described in this study, could potentially involve enhanced articulation of various phonemes. This is because the assessment allows subjects to focus on their unique speech challenges. The results presented in this study were obtained from subjects without speech impairments, which helps establish a benchmark of values. Individuals with impaired speech could use these benchmarks to compare their own results when assessing their speech.

Reviewer Comment 4:

The boundary conditions of the conducted experiments were not provided.
The article does not specify the video and audio recording parameters used during the experiments — for example, it lacks information on the audio sampling rate (e.g., 44.1 kHz or 48 kHz), video resolution (e.g., 1080p or 720p), frames per second (fps), or the models of the camera and microphone used. Similarly, the text does not mention which codecs were involved in signal conversion, the bit depth of the audio, or the duration of each recorded signal segment.
Moreover, it is unclear whether the recordings were made in a studio-like environment insulated from ambient noise, light reflections, and other factors that can negatively affect human auditory and visual perception.

Author Response 4: We appreciate the reviewer asking additional questions regarding the boundary conditions of the data collection setup. The descriptions have been added to the manuscript and added below:
Line 237-240, 255-258: 
“The desktop inbuilt camera was used to facilitate the position and circumvent the need of installing drivers for portable camera in different systems. The objective is develop a portable audio and video capturing system.”
The recorded video format is MP4, bit depth of 8, frame rate is 29.8 frames per second, and resolution is 1920 x 1080. The audio format is MP3 with number of channels 2, samples/frame is 1024 and sampling rate of 44100 Hz.”
Reviewer Comment 5:
Please include a comment in the article regarding how measurement conditions influence the behavior of Action Units (AUs). Can measurement conditions in an acoustically and visually controlled, isolated environment differ significantly from those conducted in an uncontrolled and non-isolated setting?

Author Response 5: We appreciate the reviewer asking additional questions regarding the measurement conditions. The descriptions have been added to the lines 280-288 in the manuscript and added below:
“In presented data collection the sufficient natural or white light exposure was strictly considered and maximum coverage of head shot in the field-of-view(FOV) was emphasized. For acoustic recording any ambient noise was restricted by enclosing the recording room. A Logitech Snowball Microphone for better directive feature was used. However, any detailed study for the negative impact of reverberation, and background noise has not been presented here. Some apparent noises were edited out using Audacity and other speech processing software.”
Reviewer Comment 6:
Please correct the sentence: "Within daily living situations, the unimpaired ability to speak and to be understood are something many of us take for granted".
It should rather be written as follows: "In everyday life, the ability to speak and be understood is something many of us take for granted".

Author Response 6:
We appreciate the reviewer’s feedback. The suggested change has been made in lines 32-33 of the manuscript.

 

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript needs small changes.

Reference 33 is absent.

References 1 and 6 are written with a bigger font size.

Author Response

The manuscript needs small changes.

Comment 1 Reference 33 is absent.

Comment 2 References 1 and 6 are written with a bigger font size.

Response: The authors appreciate the reviewer’s comments. Reference 33 was added, and the font sizes for references 1 and 6 were reduced so all reference sizes are uniform.

Reviewer 2 Report

Comments and Suggestions for Authors

I have appreciated the effort of a clarification for all open issues. What is still not totally convincing relies in the experimental analysis and thereof results.

My suggestion is to clearly stated that results are very preliminary. For example, at line 386, instead of  These findings contribute to a better understanding of ... the authors should underline the potentiality, as While preliminary, these results highlight aspects that could improve our understanding of ....

 

Author Response

I have appreciated the effort of a clarification for all open issues. What is still not totally convincing relies in the experimental analysis and thereof results.

My suggestion is to clearly stated that results are very preliminary. For example, at line 386, instead of These findings contribute to a better understanding of ... the authors should underline the potentiality, as While preliminary, these results highlight aspects that could improve our understanding of ....

Response: The authors appreciate the reviewer’s comment on highlighting the importance of this contribution. For evaluating speech, different characteristics aligned with speech production, facial features, and speech-audio variations are critical. AU based analysis is an initial step to characterize the role of facial features.

The suggested changes were made in line 368-374 and below with the following statement to better explain the contributions of our findings.

“While preliminary, these results highlight aspects that could potentially develop robust measures for improving and training individuals with impaired speech. The observed differences in AU activation patterns across different sound types can inform speech therapy and rehabilitation techniques for individuals with speech impairments. These insights can also be applied to the development of facial recognition and speech synthesis technologies.”

Reviewer 3 Report

Comments and Suggestions for Authors

Reviewer 1

Comments to the author:

Manuscript ID: electronics-3578491

Title: “The Role of Facial Action Units toward Investigating Facial Movements During Speech”

To begin with I would like to thank the authors for the time and the effort they have put in to address my previous comments. They managed to address successfully most of them. However, the manuscript still needs some minor corrections before it is ready for publication.

Abstract

In the abstract, the authors declare that they collected 84 videos from each participant, six trials per phoneme. Although the process is clear the sum is incorrect. If for each phoneme they videotaped six trials we measure:

5 vowels X 6 trials = 30 trials

7 consonants X 6 trials = 42 trials

5 diphthongs X 6 trials = 30 trials

All the above equal 102 trials and not 84. What I miss?

Introduction

Line 54. Decapitalize the first letter in the word assessment.  

Line 72. Delete the word level. It is unnecessary.

Line 97-107. This paragraph is not connected with the above manuscript. In this paragraph, the authors explain the utility of Visemes in speech acquisition. These lines must removed from here and placed in line 139.  An introductory sentence must be added before the authors start describing the findings of the studies using alternative measures for facial movements. Per example, “Several studies have tried to implement alternative methods to assess speech. These studies used vastly different approaches, and with different goals than our study which utilizez facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). …. to traditional methods. Thus the findings from palatography or EPG studies point out the important impact of presented AU-Acoustic based assessment feed-back approach. It is important to note that many facial expressions are innate. In other words, facial expressions do not vary between a person blind from birth and an individual with progressive loss of sight ….

Methods

Lines 177-178. Please rephrase these lines. Your group of participants was not a uniform ethno-group. 

The results of the study are well-presented.

Discussion

The authors should include in the limitations of the study, the small sample of participants. Also in future research, they should propose the replication of the study with more participants from different races as this condition would permit the generalization of the findings.

Author Response

Comment 1 Title: “The Role of Facial Action Units toward Investigating Facial Movements During Speech”

To begin with I would like to thank the authors for the time and the effort they have put in to address my previous comments. They managed to address successfully most of them. However, the manuscript still needs some minor corrections before it is ready for publication.

Abstract  In the abstract, the authors declare that they collected 84 videos from each participant, six trials per phoneme. Although the process is clear the sum is incorrect. If for each phoneme they videotaped six trials we measure:

5 vowels X 6 trials = 30 trials

7 consonants X 6 trials = 42 trials

5 diphthongs X 6 trials = 30 trials

All the above equal 102 trials and not 84. What I miss?

Response: The authors appreciate the reviewer’s comment. The number of trials were changed to 102, which is the correct number.

Comment 2 Introduction

Line 54. Decapitalize the first letter in the word assessment.  

Line 72. Delete the word level. It is unnecessary.

Line 97-107. This paragraph is not connected with the above manuscript. In this paragraph, the authors explain the utility of Visemes in speech acquisition. These lines must removed from here and placed in line 139.  An introductory sentence must be added before the authors start describing the findings of the studies using alternative measures for facial movements. Per example, “Several studies have tried to implement alternative methods to assess speech. These studies used vastly different approaches, and with different goals than our study which utilize facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). …. to traditional methods. Thus the findings from palatography or EPG studies point out the important impact of presented AU-Acoustic based assessment feed-back approach. It is important to note that many facial expressions are innate. In other words, facial expressions do not vary between a person blind from birth and an individual with progressive loss of sight…

Response: The authors appreciate the reviewer’s comment. The word assessment in line 54 was decapitalized, the word level in line 72 was deleted, and additions were made in the paragraph  previously in lines 97-107 now shown in lines 97-132.

The changes made in lines 97-132 of the manuscript and are also displayed below:

“The studies below use vastly different approaches, and with different goals than our study which utilize facial action units (AUs) (linked to facial muscle groups) to examine the production of speech (phonemes). An audio-visual speech recognition method previously utilized side-face im-ages of the lips which are captured while the audio is recorded using a microphone. Lip contour geometric features (LCGFs) used for discriminating phonemes, and lip movement velocity features (LMVFs) used for detecting voice activity were used to extract both audio and visual features. These features were used to assess speech using the Hidden Markov Models (HMM) [18]. Another study designed and compared two HMM classifiers for four emotions which have varying effects on the properties of different phonemes. The emotional states during speech were classified using phoneme-level modeling [19]. Phonetic feature-based prediction models are used to predict and understand variations in the pronunciation of phonemes. This allows for better modeling of spontaneous speech by focusing on individual articulatory features [20]. Another approach focuses on identifying and identifying the relationship between speech sounds and facial movements using a fine-grained statis-tical correlation analysis between various phonemes and facial anthropometric measurements (AMs) [21]. Another alternative to phoneme analysis using facial movements is the viseme model. Viseme maps are particularly useful in supporting accurate visual feature extraction for the visual representation of speech [22,23].

 

On the other hand, techniques like Electropalatography (EPG) or Palatography involve embedding an electrode to the palate to monitor the tongue movements.  Though noninvasive and shown effective in children and adults [24], it is not proven reliable for vowels and glides where vocal tract constrictions have greater contribution [25]. Also, EPG therapy may not be as advantageous, especially for younger children who have not shown a resistance to traditional methods. It is important to note that many facial expressions are innate. In other words, facial expressions do not vary between a person blind from birth and an individual with progressive loss of sight [26,27]. Intuitively, the perception of speech, especially in a noisy environment, is better due to visual details [28]. Furthermore, the speech voice features independent of vocal tract characteristics will need other reliable estimators to develop speech-face mapping [29,30]. It is shown through some unique facial traits identifiers with acoustic models [31]. This establishes a substantial use of facial expression for improved learning ability by visual and sound cues without discarding that Visemes are subset of phonemes and work related to Visemes will further enhance the presented investigation [32]. Thus, these findings establish the important impact of presented AU-Acoustic based assessment feedback approach.

 

Comment 3 Methods

Lines 177-178. Please rephrase these lines. Your group of participants was not a uniform ethno-group. 

The results of the study are well-presented.

Response: The authors appreciate the reviewer highlighting this statement. The participants belong to different ethnic groups. The changes made in lines 163-170 , previously lines 177-178, of the manuscript are also shown below:

“Participants learned about the study through flyers posted around the university and word of mouth. The population targeted were adults (18 – 45 years old) that were fluent, American English speakers without speech impairments; here, we did not examine non-English speaking participants.  Participants provided their informed consent prior to taking part in the research study. Fourteen participants were enrolled in the study and their demographics are shown in Table 1.  After participants gave their informed consent, their data collection session was scheduled.”                                                                       

 

Comment 4 Discussion The authors should include in the limitations of the study, the small sample of participants. Also in future research, they should propose the replication of the study with more participants from different races as this condition would permit the generalization of the findings.

Response : The authors appreciate the reviewer highlighting this statement. The limited time to collect data and also processing can be extended and improved in future with similar set up.

The following line was added in line 403-404 and below to highlight possible improvements in future studies:

“We aim to increase the sample size of participants, and the time allocated for data collection in future studies.”

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have implemented the required revisions to the manuscript and have addressed the reviewers’ comments in a detailed and satisfactory manner. The primary concern regarding the small sample size was convincingly justified in the context of the study’s methodology and objectives. 

Author Response

The authors have implemented the required revisions to the manuscript and have addressed the reviewers’ comments in a detailed and satisfactory manner. The primary concern regarding the small sample size was convincingly justified in the context of the study’s methodology and objectives. 

Response: The authors appreciate the reviewer acknowledging the revisions.

Back to TopTop