Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory

Appl. Sci. 2019, 9(8), 1599; https://doi.org/10.3390/app9081599

by Yuanyao Lu^* and Hongbo Li

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2019, 9(8), 1599; https://doi.org/10.3390/app9081599

Submission received: 27 March 2019 / Revised: 9 April 2019 / Accepted: 11 April 2019 / Published: 17 April 2019

(This article belongs to the Special Issue Augmented Reality: Current Trends, Challenges and Prospects)

Round 1

Reviewer 1 Report

Authors need to check their English grammar very carefully. In the experiments, please use the past verb all the time. I recommend authors need to check the manuscript with professional proof reading services.

Please change "the method we proposed" to "the proposed method".

Please correct "Speech" to "speech" in the introduction.

In the abstract and conclusion sections, authors need to emphasize the important experiment data to support the proposed idea.

Please use the formal English word. For example, authors had better use "Thus" or "Therefore" instead of "So".

The description for Figure 7 is insufficient to understand the concept.

Author Response

Response to Reviewer 1 Comments

Point 1: Authors need to check their English grammar very carefully. In the experiments, please use the past verb all the time. I recommend authors need to check the manuscript with professional proof reading services.

Response 1: We have checked the English grammar very carefully and used the past verb all the time in the experiments.

Point 2: Please change "the method we proposed" to "the proposed method".

Response 2: We have modified similar issues for the listed problems and errors. In abstract, “The results show that the method we proposed can not only…” has been replaced by “The results show that the proposed method can not only…” In section 2.3 and paragraph 2, “the model we proposed is trained with…” has been replaced by “the proposed model is trained with…”

Point 3: Please correct "Speech" to "speech" in the introduction.

Response 3: We have modified "Speech" to "speech" in introduction and have checked other similar issues.

Point 4: In the abstract and conclusion sections, authors need to emphasize the important experiment data to support the proposed idea.

Response 4: In the abstract, we have emphasized the form, content and pronunciation language of the dataset. Furthermore, we have showed that the experimental data has achieved excellent performance on our proposed model.

We have added “in our own independent database (English pronunciation 0 to 9, 3 males and 3 females) …”

We have added experimental results to illustrate: “The results show that the accuracy of…”

We have modified two forms of network presentation: “we compared two lip-reading models: (1) a fusion model…; (2) …”

In conclusion section, we have elaborated on the form, content, pronunciation and language of the dataset, and have explained the pronunciation differences between independent speakers and standard speakers in detail, for example：

“The experimental dataset was built independently and consisted of 3 males and 3 females. American English pronunciation from 0 to 9, and each digital pronunciation was divided into independent video clips, each independent speaker was not trained in professional pronunciation.”

And we have added experimental results：“…, and the accuracy of the proposed model is 88.2% in the test dataset which is 3.3% higher than the CNN-LSTM.”

Point 5: Please use the formal English word. For example, authors had better use "Thus" or "Therefore" instead of "So".

Response 5: In section 2.3 and paragraph 1, “help” has been replaced by “assist”.

In the abstract, “needs” has been replaced by “requires”.

In section 2.1, paragraph 2 and paragraph 3, “so that” has been replaced by “thus”.

In section 2.1, paragraph 2 and step 3, “we get” has been replaced by “is obtained”.

In section 2.2 and paragraph 6, “so that” has been replaced by “thus”.

In section 2.2 and paragraph 7, “so that” has been replaced by “thus”.

In section 2.3, and paragraph 2, “so” has been replaced by “thus”.

In the introduction and paragraph 6; in section 2 and paragraph 1; in section 2.1, paragraph 2 and step 2; In section 2.1, paragraph 2 and steps 3; In section 2.2 and paragraph 3; In section 2.2 and paragraph 4; “next” has been replaced by “then”.

Point 6: The description for Figure 7 is insufficient to understand the concept.

Response 6: In section 3.2 and paragraph 4, we have analyzed Figure 9 in detail and have made conclusions from it. For example: “Figure 9 showed the visualization of the attention mechanism, each line showed the weight of attention at each moment from zero to nine, the deeper color the greater weight of attention. It could be inferred that the attention area of each pronunciation was concentrated near the 3rd and 7th frames, because the 3rd, 4th, 7th, and 8th frames contained the main information of the lip motion. These frames were related to the video theme and contained a certain chronological order, which were considered to be important video frames, the model assigned a large amount of attention weight. This attention distribution showed that the attention mechanism optimized the proportion of key frame input and achieved the requirement of allocating more information processing resources to important parts.”

Reviewer 2 Report

The authors present an interesting paper about automatic lip-reading recognition method; using a hybrid neural network of a Convolutional Neural Network for image feature extraction, and an attention-based Recurrent Neural Network for automatic lip-reading recognition. With regard to their paper I would like to make the following observations:

The abstract section should follow the structure suggested by this Journal; i.e.it should be a total of about 200 words maximum.

Several typos should be revised throughout the paper. E.g. “Automatic lip-reading recognition method our proposed...” should be changed for “Our proposed method for automatic lip-reading recognition...”.

Abbreviations should be defined just the first time they are used; e.g. ROI is defined twice.

The authors should keep the introduction comprehensible to scientists working outside the topic of the paper; e.g. VGG19 and SoftMax should be further explained.

I wonder if Section 2 has the proper title, because it is more a “Materials & Methods” section than a “Related Work” section. Furthermore, the last paragraph of Introduction section (where the content of Section 2 is advanced) is mistyped.

The authors should explain the meaning of the box “Linea, SoftMax” in Figure 1.

Two figures have been referenced as “Figure 3”; last one should be renumbered to “Figure 4”.

In (the first) Figure 3, the authors should label the left entry to the LSTM unit structure (I assume it is x_t, as the other three inputs to the LSTM unit, although in all cases [h_t-1, x_t] should be entered). Moreover, the author should explain the extra arrow from the forget gate to c_t.

I infer, from Equation 5, that the authors are using a logistic sigmoid as the input activation function for the LSTM unit; while a “tanh” function is being used as the output activation function. The authors should explain this decision.

The authors should always use a homogeneous notation; e.g. they use both α_t,i and α_ti, although I assume both notations represent the same “attention weight” term.

I also encourage the authors to further clarify their model for Attention-based LSTM. In this regard, they should explain the role of W^k terms in (the second) Figure 3. It would also be worthwhile including a general algorithm describing all the process.

With regard to section 3, the authors should clarify the first language and the English regional variation of the speakers. Moreover, the influence of these factors on the lip-reading recognition should be discussed.

Finally, the authors should also revise the references section in order to keep consistency and follow the format suggested by the Journal.

Author Response

Response to Reviewer 2 Comments

Point 1: The authors present an interesting paper about automatic lip-reading recognition method; using a hybrid neural network of a Convolutional Neural Network for image feature extraction, and an attention-based Recurrent Neural Network for automatic lip-reading recognition. With regard to their paper I would like to make the following observations:

The abstract section should follow the structure suggested by this Journal; i.e.it should be a total of about 200 words maximum.

Response 1: The words and highlights of the abstract have been carefully revised, and the statement description has been changed to make it more perfect. The total number of words is 235 words now.

Point 2: Several typos should be revised throughout the paper. E.g. “Automatic lip-reading recognition method our proposed...” should be changed for “Our proposed method for automatic lip-reading recognition...”

Response 2: We have modified similar issues for the listed problems and errors. For example:

In abstract, “Automatic lip-reading recognition method our proposed...” has been replaced by “Our proposed method for automatic lip-reading recognition…”

In introduction section, paragraph 5, “The method of automatic lip-reading recognition we proposed can be…” has been replaced by “Our proposed method for automatic lip-reading recognition we proposed can be…”

Point 3: Abbreviations should be defined just the first time they are used; e.g. ROI is defined twice.

Response 3: We have dealt with similar issues, for example:

In conclusion section, we have removed the definition of ROI.

In section 2 and paragraph 1, we have removed the definition of CNN.

In section 2.2 and paragraph 1, we have removed the definition of RNN.

In section 2 and paragraph 1, we have removed the definition of LSTM.

Point 4: The authors should keep the introduction comprehensible to scientists working outside the topic of the paper; e.g. VGG19 and SoftMax should be further explained.

Response 4: In section 2.3 and paragraph 2, we have explained in detail the principles and functions of VGG19 and SoftMax, and added diagrams and formulas to better clarify related concepts. We have added the network architecture diagram of VGG19 and the SoftMax mathematical formula, calculation process, and its effect.

Point 5: I wonder if Section 2 has the proper title, because it is more a “Materials & Methods” section than a “Related Work” section. Furthermore, the last paragraph of Introduction section (where the content of Section 2 is advanced) is mistyped.

Response 5: We have changed the title of section 2 and fixed title errors in section 3.

Section 2: “Related Work” has been changed to “Proposed Lip-Reading Model”.

Section 3: “Experimental Dataset and results” has been changed to “Experimental Dataset and Results”.

We have summarized the contents of section 2 and deleted the specific details in Introduction.

Point 6: The authors should explain the meaning of the box “Linea, SoftMax” in Figure 1.

Response 6: We have modified the recognition result part of Figure 1 and the "Linea, SoftMax" box has been split into the "fully connected layers" box and the "SoftMax" box. Furthermore, we have explained the meaning of the relevant content.

We have described Figure 1 in detail in section 2. For example: “Finally, the ten-dimensional features are mapped through two fully connected layers, and the result of automatic lip-reading recognition is predicted by SoftMax layer. SoftMax normalizes the output of the fully connected layers and classifies it according to probability. The sum of probabilities is 1.”

Point 7: Two figures have been referenced as “Figure 3”; last one should be renumbered to “Figure 4”.

Response 7: We have modified the issue where two figures are referenced as "Figure 3"; the subsequent text and the numbers in the references have been renumbered.

Point 8: In (the first) Figure 3, the authors should label the left entry to the LSTM unit structure (I assume it is x_t, as the other three inputs to the LSTM unit, although in all cases [h_t-1, x_t] should be entered). Moreover, the author should explain the extra arrow from the forget gate to c_t.

Response 8: We have corrected Figure 3. The label of the left entry to the LSTM unit structure is x_t, and labeled it in figure 3. Then we elaborated on the process from forget gate to the door to c_t, and explained the working principles of LSTM cell. For example:

“Second, the cell update useful information into cell status and multiply the old cell state c_t-1 and the output of “forget gate” f_t as the part input of cell , then summing it with the product of “input gate” output i_t and candidate information . The result of the calculation is the updated c_t.”

Point 9: I infer, from Equation 5, that the authors are using a logistic sigmoid as the input activation function for the LSTM unit; while a “tanh” function is being used as the output activation function. The authors should explain this decision.

Response 9: We have modified the error in equation (5). In section 2, paragraph 3 and 4, we have already explained why the inputs and outputs use tanh, and why the LSTM gates use sigmoid.

Point 10: The authors should always use a homogeneous notation; e.g. they use both α_t,i and α_ti, although I assume both notations represent the same “attention weight” term.

Response 10: We have modified the variable name α_ti and used a homogeneous notation.

Point 11: I also encourage the authors to further clarify their model for Attention-based LSTM. In this regard, they should explain the role of W^k terms in (the second) Figure 3. It would also be worthwhile including a general algorithm describing all the process.

Response 11: We have replaced W^k with h_n in Figure 4, which represents the output of the hidden layer state of the LSTM cell at time n. In the text, we have explained its role and modified Figure 4 in detail to facilitate understanding of the general algorithm for attention processes.

Referring to Figure 4, we have added the input of the attention-based LSTM network (Equation (9)) and changed the name of the variable with description in the body.

Point 12: With regard to section 3, the authors should clarify the first language and the English regional variation of the speakers. Moreover, the influence of these factors on the lip-reading recognition should be discussed.

Response 12: In section 3.1, we have clarified the English language differences between the native language and the speakers.

In section 3.2 and paragraph 5, we have analyzed the recognition results from the data form and explained the impact of these factors on the experimental results. We explains the problems and values faced by this dataset in practical application research.

Point 13: Finally, the authors should also revise the references section in order to keep consistency and follow the format suggested by the Journal.

Response 13: We have modified the references by using the endnote software's MDPI official website reference format.

Round 2

Reviewer 1 Report

Authors replied all questions as reviewer requested very clearly so the paper can be accepted.

Article Menu

Automatic Lip-Reading System Based on Deep Convolutional Neural Network and Attention-Based Long Short-Term Memory

Further Information

Guidelines

MDPI Initiatives

Follow MDPI