Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Facial Expression Recognition Based on Auxiliary Models

Algorithms 2019, 12(11), 227; https://doi.org/10.3390/a12110227

by Yingying Wang¹

, Yibin Li^1,*, Yong Song² and Xuewen Rong¹

Reviewer 1:

Ruben San-Segundo

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Algorithms 2019, 12(11), 227; https://doi.org/10.3390/a12110227

Submission received: 20 September 2019 / Revised: 24 October 2019 / Accepted: 28 October 2019 / Published: 31 October 2019

(This article belongs to the Special Issue Algorithms for Human-Computer Interaction)

Round 1

Reviewer 1 Report

This paper proposes the combination of several CNN for facial expression recognition focusing on specific parts: eye, mouth, nose. These adapted systems are name auxiliary models. I think the paper is interesting but there are several aspects that must be improved significantly.

In order:

Figure 1: increase font, difficult to see

Figure 2: Bigger photos, remove deformation and improve caption

Figure 3: increase font and include a descriptive caption

Lines 228: “four pairs of images…” I do not understand this expression

Figure 8: Increase font

Figure 9: How were these systems tunned???

Line 246: why using this normalization (dividing by max(pi)) instead of dividing by the sum of all pi???

Lines 247-253 improve description. arg max in i??

Lines 256 y 257: I do not understand “strong abstract feature??” or “strong expression ability”??

Figure 10: please do not deform the figures, hard to see differences

Figure 11-12: can you include the same person with different emotions??

Line 289: 4840+-2?? What does it mean?

Figure 14: increase font

Figure 15: include confidence intervals for analyzing significance of the results. Improve caption, mouse??

In general, I am not an English person, but I think the paper must improve the English style.

Author Response

Thank you very much for your suggestion. I have made some changes to the relevant proposal.

1：The font in figure 1 has been enlarged

2: The size of figure 2 has been resized and the title of this figure has been changed accordingly

3: The size and font of figure 3 has been made some adjustments, and the caption of this figure has been descripted in detail: The structure of the common convolutional neural network. The convolutional neural network mainly includes convolution layer, pooling layer and full connection layer, and different layers have different functions. The convolution layer is responsible for feature extraction, the pooling layer is used for feature selection, and the full connection layer is used for classification.

4: ‘Four images’ stands for the image of the whole face, the image only includes nose, the image only includes the eyes and the image only includes mouth

5: The font in figure 8 has been enlarged accordingly.

6: The four models work in parallel. The probability vector expression of an expression is obtained for each model, and the final expression probability value is obtained through the weighted fusion algorithm.

7: Every value is divided by the maximum value for acquiring the values in 0-1, which is useful for the next calculation.

8: There are seven values can be get from the above equation, the final recognition label can be get from the seven values.

9: Figure 10 has been adjusted

10: Figure 11 and figure 12 has been changed. One person that has multiple emotions can be acquired, but one person has 7 expressions can’t be get.

11: There are 4840 or 4842 samples in every expression label.

12 Figure 14 has been enlarged

13 The title of figure 15 has been changed.

Reviewer 2 Report

The study is well reported. I would presenter in wider form the results and discuss deeper the motivations of the reached performances compared to other approach.

The results are very encouraging. Are there any factors or threats to reduce the reached performances? How scalable and reliable is the approach?

Author Response

Thank you very much for your offer.

If the images of facial expression are blocked, the final recognition result will be affected. This is the next task that I will research.

Reviewer 3 Report

The paper presents a method that combines multiple sub-regions and the entire face image, which can capture more important feature information that conducive to improve the final recognition accuracy. The authors of the paper state that eyes, nose, and mouth are most sensitive facial parts that play significant role in final result of expression recognition.

This paper is well organized and presented at a good level of conducted research and explanation of the results. Nevertheless, there are some style errors to be considered and fixed by the authors: the text in Figure 1 is too small for visual perception or may be better visualize the proposed algorithm in UML notation; the titles of the 2^nd level headers (3.3, 3.4) have different text case beginnings; tables and figures are need to be presented after they are mentioned in the text (e.g. Table 2); some Figure titles are not well written (e.g. “Figure 3. commoncnn”). Also the figure text is extremely small. Figure 4 should be explained in details.

Also, authors claim that their proposed method finds the location of the eyes that is crucial for their method to work. It is not clear what the algorithm stands behind this operation in the paper. How will the method behave when the eyes are difficult to detect because of the worn sunglasses? Can system be adapted to work with video clips?

In addition, it would be interested to evaluate the proposed method in different light conditions (e.g. with Oulu-CASIA NIR-VIS database), different gaze directions (with Radboud Faces Database (RaFD)).

It is not clear why authors propose to resize images into 96*96 pixels?

Author Response

Thank you very much for your offer.

1 The font in figure 1 has been enlarged

2 The title of 3.4 has been changed.

3 The title of figure 3 has been changed:The structure of the common convolutional neural network. The convolutional neural network mainly includes convolution layer, pooling layer and full connection layer, and different layers have different functions. The convolution layer is responsible for feature extraction, the pooling layer is used for feature selection, and the full connection layer is used for classification.

4 The new caption of figure 4: The basic operation of convolution layers. A new feature representation can be obtained by a certain operation, which can be used to obtain a deeper feature expression. The more convolution kernels, the more features can be learned. Different convolution kernels will produce different images.

5 The location of the eyes can improve the accuracy of the location of other important organs. If the eye is blocked, the location of these important components can be get by the proportion of the face.

6 The pixel of 96*96 is the empirical value during the experiment process.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

I think the authors have addressed the main aspects from my previous comment and the paper can be accepted

Article Menu

Facial Expression Recognition Based on Auxiliary Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI