Next Article in Journal
Novel Understandings of Biomineralization in Backfill Materials: A Fundamental Investigation of Coal Gangue and Fly Ash Impact on B. pasteurii to Enhance Material Properties
Next Article in Special Issue
A Dual-Branch Speech Enhancement Model with Harmonic Repair
Previous Article in Journal
Agile Attitude Maneuver Control of Micro-Satellites for Multi-Target Observation Based on Piecewise Power Reaching Law and Variable-Structure Sliding Mode Control
Previous Article in Special Issue
Exploring Multi-Stage GAN with Self-Attention for Speech Enhancement
 
 
Article
Peer-Review Record

Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features

Appl. Sci. 2024, 14(2), 798; https://doi.org/10.3390/app14020798
by Zhongping Dong 1, Yan Xu 1, Andrew Abel 2,* and Dong Wang 3
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Appl. Sci. 2024, 14(2), 798; https://doi.org/10.3390/app14020798
Submission received: 17 October 2023 / Revised: 4 December 2023 / Accepted: 7 December 2023 / Published: 17 January 2024
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In the first section (introduction), the authors substantiate in detail the relevance of the task they are solving to improve the quality of speech perception and the need to use Gabor filtering for this purpose. The second section (literature review) made by the authors contains a large number of references used to solve their task of improving speech perception. The literature review discusses in great detail the trends used in the literature when working with the speech of one person and several people, preparing the reader to discuss this material and its connection with neural networks. Here, the difficulties of solving the task are highlighted and the reader is prepared to discuss the main tasks of the work. The third section is devoted to the description and detailed analysis of Gabor filtering. It seems to me that the authors very successfully apply Gabor filtering, which allows us to significantly reduce the number of analyzed features, focusing on the movement of the speaker's lips and avoiding analyzing the entire speaker's face. This technique is central to the article and, applying it to various generalizations of Gabor filtering (GaborCNN2Speech, GaborFea2Speech), the authors obtain very significant results in compressing perceived information (simplify input and reduce the number of CNN layers from 7 to 1). The conclusions made by the authors are illustrated by Tables 2-7 and especially Figures 8, 10. I am very impressed that the authors characterize their results not by one, but by several performance indicators. The results of the discussion and conclusions are very important. They included statements of new applied problems arising from the authors' rather self-critical statements about the possibility of practical application of the results obtained.

As a comment, we can point out the need not only to build graphs illustrating the advantages or disadvantages of the proposed designs, but also analytical estimates explaining why the proposed Gabor filtering techniques are so effective.

Author Response

Thank you for your review, please see the attachment for my response to your comments.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors, 

The Introduction is a mixture of well-known and completely wrong things.

 

Example 1: Positioning by paraphrase relative to something that has been said by others, without assuming the minimum level of criticism and truth:

 

"Speech reconstruction differs from speech recognition in that it is a language-independent visual-to-audio task with the goal of being unaffected by background noise."

 

This saying here is a typical pseudoscientific bla-bla. It combines "everything" to create the impression of a holistic approach and understanding, it assumes improper logic delimitations between ideas and between different technological approaches, and not the least - it delivers hyper-optimistic accreditives to a branch of technology by citation and paraphrase. Do you really think speech recognition today is language-independent and user-independent!?! Can it be language-independent and speaker-independent just because someone uninspired said that somewhere, and because you cite that idea? If this is just a belief or rather a hope you want to give credit, you should express your support for that in a totally different manner. If it is a sound scientific truth - absolutely great! - but please don't forget to deliver us the proof(s) that it is language-independent and user-independent! 

Speech Reconstruction (SR) from noisy audio-video recordings is a totally different story than SR on mute video surveillance recordings.

Supposing someone solves the second one, then the first one is implicitly solved also (not necessarily the other way around!), but who proved they solved the second one? And if there is only your claim for proof, how many users were tested? How were they tested? Real-time or just on very limited collections of movies?   

 

The fact is, "Speech Reconstruction" is still an unsolved problem in science and technology today, and just saying hyper-optimistic things about it makes no progress!

 

Example 2: in the category of fake ideas masked as concerns about the degree of difficulty of some tasks:

"tonal information was much harder to recognise due to the lack of associated mouth movements"

Please indicate a paper proving that tonal information can be successfully extracted from mute video recordings.

The fact is, the tonal information of the voice was never successfully recognized from mute video recordings !!! With what we have and what we know today as science and technology of the public domain, tonal information is not "much harder to recognise", it is simply impossible to recognize in mute video recordings!!!  

So "was much harder"?!!! When "was" this done?!!! Where?!!! Who proved this "was" done?!!! 

 

Improper usage of acronyms:

PESQ - stands precisely for Perceptual Evaluation of Speech Quality. Then, we cannot rename it simply by Speech Quality 

STOI is translated in two different meanings: Short-time Objective Intelligibility, and speech intelligibility

 

Improper assessment of the datasets:

"large datasets without any constraints, such as Lip2Wav"

As you said, Lip2Wav contains recordings for only 5 users!

Is this a "large" dataset? or it is just illustrating the fact that speech reconstruction is nowadays practiced and approached as a speaker-specific task?  

 

Next, all kinds of metrics are proposed and used for the field of speech reconstruction. By using these performance parameters, different approaches are compared. Still, speech reconstruction is a species of (speech) recognition. Therefore, any two different approaches can be and should be analyzed and compared by using False Accept and False Reject rates in the first place. Unfortunately, this is not happening in this paper or in the current literature. Why? Because when you express the performances of a speech reconstruction system by using these two objective and counterbalancing measures, the current speech reconstruction technology is scored as completely unreliable. Excepting the situation in which you really have some interesting results to announce in terms of FAR and FRR, there is no need to publish another paper using for evaluation purposes parameters that are by design crafted to suggest some level of performance when indeed recognition performance is actually missing and not expressed using objectively defined performance parameters. 

 

To assemble another performance parameter as in formula (3)  - as a combination of other non-objective performance metrics - makes no real sense and no real progress in the field, and only illustrates how far one can go when crafting parameters, especially for overvaluating the actual performances that are actually reachable in speech reconstruction today.

 

Things being asserted but not proved:

Figure 9 demonstrates that our proposed GaborFea2Speech model exhibits minimal sensitivity to the number of speakers when reconstructing spectrograms and waveforms. Again where is speech recognition tested here for 100, 1000, 10'000, or 100'000 different speakers?

 

If face recognition or any kind of object recognition will be "proved" by others for only 34 individuals/categories of objects (and only by using custom-made special performance parameters, but otherwise with very low objectively measurable performances) would you consider it a topic of sound science to be published in applsci? 

Comments on the Quality of English Language

at least the basic English proofing tools that are  default in different editors should be applied prior to submission

Author Response

Thank you for your detailed analysis and review, we have responded to your points in the attached pdf file.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The review is attached.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

moderate editing

Author Response

Thank you for your positive review, we have responded to your comments in the attached pdf file.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper introduces an innovative approach to speech reconstruction from visual cues, particularly lip and facial movements, in scenarios where audio signals are disrupted or absent. The authors propose a Gabor-based speech reconstruction system featuring two novel models, GaborCNN2Speech and GaborFea2Speech, which leverage Gabor feature extraction to reduce model complexity and enhance transparency. Comprehensive experiments on the GRID corpus demonstrate the superior performance of these models in sentence and vocabulary reconstruction compared to traditional end-to-end CNN models. Notably, the GaborFea2Speech model stands out for its ability to achieve multi-speaker speech reconstruction without supplementary information, marking a significant milestone in speech reconstruction.

Here are some points to be reviewed:

 

- The paper focuses solely on visual speech signals and does not consider incorporating other modalities like tongue movements that could improve speech reconstruction.

- Only frontal views of the face are used. Using multiple views could provide more visual information.

- The results are only evaluated on a single dataset (GRID). Testing on additional datasets would make the results more robust.

- Only limited quantitative metrics like Corr2D, PESQ, etc are used for evaluation. More metrics could give further insights.

- The paper claims lightweight models but does not quantify training times, model size, memory usage etc to demonstrate lightness.

- The complexity and training times of the models are not analyzed in detail or compared numerically.

- The Gabor filtering process is not optimized or tuned; the parameters are manually set based on basic preliminary experiments.

- The methods for generating multi-speaker datasets for testing are not clearly explained or justified.

- The effects of variations in training data size are not explored.

- The choice of GRU vs LSTM vs other RNN types is not justified.

- Augmentations like noise addition seem adhoc and arbitrary rather than systematically studied.

- The model robustness against variations and noise is not analyzed.

- The type of Gabor features extracted could be expanded beyond the basic geometric features described.

- More details could be provided on the Gabor feature pre-processing methods used.

- The effects of Gabor kernel parameters on results are not studied in detail.

- Analysis of why Gabor features work better than raw pixels or other features is lacking.

- Insufficient analysis is done into cases when Gabor performs poorly as CNNs.

- The effects of training dataset size should be analyzed more thoroughly.

 

- The choice of the optimizer, learning rate schedules, regularization, etc could be analyzed more.

- Ablation studies could isolate the effects of individual components.

- More visualization and interpretation of what the models are learning could add insights.

- Failure cases and model limitations should be highlighted.

- The speaker independence seems limited, with performance dropping significantly from 1 to 8 speakers.

- The model still seems to require similar speakers between train and test. Testing on completely unseen speakers could be insightful.

- The paper conclusion promising lightweight models seems overstated based on the evidence presented.

- Directions for future work are quite limited and could be expanded on.

 

 

Some unclear or illogical sentences in the paper:

- "Speech reconstruction enables the generation of speech signals from silent lip movement videos [1]."

- "To the best of our knowledge, there is no relevant related research addressing this problem in this domain."

- "Compared to raw facial pixels as visual input, using the entire face picture as input results in a large amount of visual input data that can be overwhelming for a model to learn."

- "By using Gabor filtering to filter pixel information that is not related to speech, we can reduce the input size from 155,952 to 784."

- "The original pixel-based image input, which required 155,952 inputs, can be reduced to just seven Gabor lip eigenvalues."

- "To the best of our knowledge, this is the first attempt to use Gabor Features for speech reconstruction and validate the effectiveness of Gabor features in speech reconstruction..."

- "By using fewer and more precise visual input values contributes to the accuracy and robustness of the model..."

- "Using Gabor lip eigenvalues, which directly reflect lip changes, additional models are not required, resulting in faster speech reconstruction and more lightweight models."

 

Comments on the Quality of English Language

Extensive editing of the English language required:

 

 

Some unclear or illogical sentences in the paper:

- "Speech reconstruction enables the generation of speech signals from silent lip movement videos [1]."

- "To the best of our knowledge, there is no relevant related research addressing this problem in this domain."

- "Compared to raw facial pixels as visual input, using the entire face picture as input results in a large amount of visual input data that can be overwhelming for a model to learn."

- "By using Gabor filtering to filter pixel information that is not related to speech, we can reduce the input size from 155,952 to 784."

- "The original pixel-based image input, which required 155,952 inputs, can be reduced to just seven Gabor lip eigenvalues."

- "To the best of our knowledge, this is the first attempt to use Gabor Features for speech reconstruction and validate the effectiveness of Gabor features in speech reconstruction..."

- "By using fewer and more precise visual input values contributes to the accuracy and robustness of the model..."

- "Using Gabor lip eigenvalues, which directly reflect lip changes, additional models are not required, resulting in faster speech reconstruction and more lightweight models."

Author Response

Thank you for your detailed review.  We have carefully gone through each of your points, and responded to them in the attached pdf.  

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

     Dear Authors, 

Congrats on taking the review remarks as seriously as you did. 

As a result, I can now recommend your manuscript for publication.

All the best, 

App. Sci. Reviewer 101F438C

Nov 27, 2023

 

Comments on the Quality of English Language

no comments, just an advice to have one English check before publication

Author Response

Thank you for reviewing again, we were glad to have your comments, as it ultimately made for a better paper.  We've given it a very thorough proofread, and fixed a couple style aspects, as well as a couple little typos.

 

 

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have addressed most of the reviewers' concerns.

Comments on the Quality of English Language

Moderate editing of English language required

Author Response

Thank you for the review.  Your feedback was very useful, both directly for the project, and for considering future directions.  We have reviewed the paper carefully, and corrected a few small style things, as well as a couple of typos.

 

 

Back to TopTop