Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

FPGA Chip Design of Sensors for Emotion Detection Based on Consecutive Facial Images by Combining CNN and LSTM

Electronics 2025, 14(16), 3250; https://doi.org/10.3390/electronics14163250

by Shing-Tai Pan^*

and Han-Jui Wu

Reviewer 1: Anonymous

Reviewer 2:

Ivan Ćirić

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2025, 14(16), 3250; https://doi.org/10.3390/electronics14163250

Submission received: 18 July 2025 / Revised: 10 August 2025 / Accepted: 14 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue Lab-on-Chip Biosensors)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In Figure 5, the facial image size is indicated as 418×418, whereas in Figure 4, it is listed as 428×428. Please verify which value is correct and revise the figures accordingly to ensure consistency.
A reference for Terasic’s VEEK-MT2S board appears to be missing. It is recommended to include an appropriate citation or official source for the hardware.
This paper employs a CLDNN model to recognize facial emotions and demonstrates its effectiveness in terms of accuracy, as shown in Tables 10, 13, and 16. However, if the improvement in accuracy is solely attributed to the application of the CLDNN model, it would be beneficial to include a survey and analysis of similar applications that utilize CLDNN in related domains.
While FPGA is commonly used as a prototype platform prior to ASIC implementation, it is also gaining attention as a hardware accelerator that can serve as an alternative to GPUs. Therefore, the paper should clearly articulate the purpose of implementing facial emotion recognition on FPGA hardware.
If the primary objective is to develop hardware for inference at the edge in an edge computing context—replacing GPUs—then it is essential to provide a comparison with inference performance on commonly used GPU or CPU-based edge devices. At a minimum, a performance comparison with the PC environment described in Table 2 should be included.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper presents a comprehensive and well-structured approach to real-time emotion recognition using a hybrid CNN-LSTM model (CLDNN) implemented on both PC and FPGA. It effectively demonstrates high recognition accuracy across multiple datasets (RAVDESS, BAUM-1s, eNTERFACE’05) and confirms the feasibility of hardware deployment on an FPGA platform using HLS techniques.

The paper is technically sound, and the experimental results support the claims. However, there are some minor improvements, grammatical errors, awkward phrases, and technical errors that should be addressed before publication.

Introduction section, Related works and References should be improved and some recent articles regarding emotion detection should be referenced. There are only 7 out of 25 references published in last 5 years. Please update reference list with most recent researches, and review them in Introduction or Related works sections.

Important technical issue is that quality and size of all the figures in manuscript should be improved. Also, text in figures is barely visible and it should be much larger and readable.

There are some grammatical and style issues that must be improved before publication, such as:

In Introduction section “image and speech emotion recognition technologies have become significant applications.” sounds awkward. It sounds better: “image and speech emotion recognition have become key applications of AI. Phrase ”such as driver monitor” should be “such as driver monitoring”.
In section 3 “will be also introduced” should be “will also be introduced”
Throughout whole manuscript there is inconsistent usage of "model inference" vs. "model’s inference". Also, sometimes “CNN” is written correctly, other times inconsistent - “convolution neural networks” should be “convolutional neural networks”.

The paper presents a solid contribution and only requires minor revisions. Once these issues are addressed, it will be ready for publication.

Comments on the Quality of English Language

There are some grammatical and style issues that must be improved before publication, such as:

In Introduction section “image and speech emotion recognition technologies have become significant applications.” sounds awkward. It sounds better: “image and speech emotion recognition have become key applications of AI. Phrase ”such as driver monitor” should be “such as driver monitoring”.
In section 3 “will be also introduced” should be “will also be introduced”
Throughout whole manuscript there is inconsistent usage of "model inference" vs. "model’s inference". Also, sometimes “CNN” is written correctly, other times inconsistent - “convolution neural networks” should be “convolutional neural networks”.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This article introduces a deep learning model designed for sequential facial emotion recognition. The proposed model was implemented on an embedded system using an FPGA chip.

The literature review covers six relevant studies; however, several of the summaries, particularly those corresponding to references [10], [11], and [13], are too brief and lack sufficient detail. Moreover, the review does not include a comparative analysis of the surveyed papers or a synthesis of their key findings.

Figure 2 is overly simplistic, as it merely illustrates the dataset split into training, validation, and test sets without providing further insight.

In Table 7, the authors present various parameter settings, but they fail to justify the rationale behind the selection of these values.

Regarding Figure 7, which depicts the architecture of the proposed model, the authors do not explain the motivation for using six cascaded LFLBs. Clarification is needed on whether the purpose relates to feature abstraction, spatial reduction, increased non-linearity, or other factors.

While the results in Figure 11 and Table 12 report high classification accuracy (exceeding 93% for most classes), there is a notable inconsistency with the overall test performance metrics in Table 11, where both accuracy and loss hover around 88%.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

1. Explain the specific method of sampling on the BAUM-1s dataset and analyze its impact on the generalization ability of the model.

2. For the low accuracy of the BAUM-1s dataset, the analysis of the causes of misclassification of each sentiment category should be supplemented, and the robustness test of the model on different datasets should be added.

3. The graphical logic is clear, but some diagrams are incomplete, and the flow chart implemented by FPGA does not show the hardware resource usage rate.

4. Unify the dataset name and fix table numbering errors.

5. Supplement the database's formal citation sources to ensure document integrity.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

It is judged that all the comments pointed out in the review have been well reflected.

Reviewer 4 Report

Comments and Suggestions for Authors

All of my concerns have been addressed.

Article Menu

FPGA Chip Design of Sensors for Emotion Detection Based on Consecutive Facial Images by Combining CNN and LSTM

Further Information

Guidelines

MDPI Initiatives

Follow MDPI