TheraSense: Deep Learning for Facial Emotion Analysis in Mental Health Teleconsultation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper is a description of the implementation and performance evaluation of “TheraSense” a system aimed to provide real-time emotion recognition based on video streams during teleconsultations in mental healthcare.
While the core technologies of face and emotion recognition are not new, the proposed system remains innovative by including these technologies in the telemedicine context.
The paper is well structured and the overall quality of the presentation is good. There are, of course, some details that can be improved. Below is a list of suggestions on how to improve the paper:
· A structured abstract with subheadings (e.g. Background, Method, Results, Conclusion) would quickly convey the main ideas and improve readability.
· The introduction highlights the general benefits of teleconsultation, such as accessibility, reduced travel, and improved patient flow. However, the motivation to include automatic emotion recognition (AER) in a system where a human therapist (theoretically trained to recognize emotions) is present is not discussed in sufficient detail. While AER makes perfect sense in applications like remote monitoring of elderly patients, or companion robots, the benefits of using AER in parallel with human therapists are less obvious.
· The description of the implementation (section 4, lines 148-200) presents the technologies used (Node.js, TensorFlow.js, Face-api.js) and the overall architecture but does not provide specific details on how the data was preprocessed before being fed into the emotion recognition models. For the sake of reproducibility, more details on the preprocessing phase would be welcome (maybe a data flow diagram or pseudocode for preprocessing steps..)
· Though the keyword “deep learning” is present in the title of the paper, the presentations does not provide sufficient details about the actual learning involved in the solution.
· It is always useful to discuss the limitations of the proposed solution.
Author Response
Thank you for your feedback.
Comments 1 : While the core technologies of face and emotion recognition are not new, the proposed system remains innovative by including these technologies in the telemedicine context.
The paper is well structured and the overall quality of the presentation is good.
Response 1 : Thank you
Comments 2 : A structured abstract with subheadings (e.g. Background, Method, Results, Conclusion) would quickly convey the main ideas and improve readability.
Response 2 : The abstract now includes subheadings (Background, Method, Results, Conclusion) for better readability and clarity.
Comments 3 : The introduction highlights the general benefits of teleconsultation, such as accessibility, reduced travel, and improved patient flow. However, the motivation to include automatic emotion recognition (AER) in a system where a human therapist (theoretically trained to recognize emotions) is present is not discussed in sufficient detail. While AER makes perfect sense in applications like remote monitoring of elderly patients, or companion robots, the benefits of using AER in parallel with human therapists are less obvious.
Response 3 : A paragraph has been added to the introduction discussing how Automatic Emotion Recognition (AER) supports therapists by enhancing subtle emotion detection, providing real-time feedback, and ensuring consistent emotion tracking. This paragraph includes a reference to Wolf (2015), confirming that measuring facial expressions is a valuable tool for assessing emotions and mental health.
Comments 4 : The description of the implementation (section 4, lines 148-200) presents the technologies used (Node.js, TensorFlow.js, Face-api.js) and the overall architecture but does not provide specific details on how the data was preprocessed before being fed into the emotion recognition models. For the sake of reproducibility, more details on the preprocessing phase would be welcome (maybe a data flow diagram or pseudocode for preprocessing steps..)
Response 4 : More details about the data preprocessing phase have been added. Figure 7 illustrates the data flow and preprocessing steps in the browser, clarifying the process before inputting data into emotion recognition models. Figure 1 in the introduction illustrates the process of facial emotion recognition using CNNs. Additionally, Figure 5 has been updated with more detailed information on the architecture of the implementation.
Comments 5 : Though the keyword “deep learning” is present in the title of the paper, the presentations does not provide sufficient details about the actual learning involved in the solution.
Response 5 : The manuscript was updated to clarify the use of CNN-based deep learning for emotion recognition, with expanded information on the model architecture, training process, and implementation challenges.
Comments 6 : It is always useful to discuss the limitations of the proposed solution.
Response 6 : A paragraph on the limitations, focusing on hardware and network challenges (e.g., accuracy, scalability, network stability, lighting conditions, and device resource requirements), has been added to the conclusion.
Thank you again for your valuable feedback.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe presentation has been very well structured and analyzed, while sufficient details are given for understanding the concepts of the intended application. The experimental and analysis methodology is sound, while sufficient details are given to replicate the proposed experimental procedures and analysis. The paper also presents sufficient outcome tests and metrics of the various models in existing approaches and of the presented TheraSense component.
A few minor language corrections should be made, regarding missing prepositions or vague sentences. Please, check accordingly lines 7, 12-13, 216, 309-310 and 406.
Author Response
Thank you for your helpful input and suggestions.
Comments 1 : The presentation has been very well structured and analyzed, while sufficient details are given for understanding the concepts of the intended application. The experimental and analysis methodology is sound, while sufficient details are given to replicate the proposed experimental procedures and analysis. The paper also presents sufficient outcome tests and metrics of the various models in existing approaches and of the presented TheraSense component.
Response 1 : Thank you very much for your thoughtful feedback and kind words about the structure, analysis, and methodology of our work.
Comments 2 : A few minor language corrections should be made, regarding missing prepositions or vague sentences. Please, check accordingly lines 7, 12-13, 216, 309-310 and 406.
Response 2 : We have carefully reviewed the lines you mentioned (7, 12-13, 216, 309-310, and 406) and made the necessary corrections to address the minor language issues, including missing prepositions and vague sentences.
Thank you again!
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsSummary: This article presents TheraSense, a component of the CareSync framework developed for the Smile and SenseCare projects. The authors stated that TheraSense enhances teleconsultation by leveraging deep learning to recognize facial emotions and integrates with the SenseCare KM-EP platform, providing mental health practitioners real-time insights during remote sessions. The authors outlined the system's design, including its use cases, architecture, user interface, and interoperability methods within CareSync. Assessment methods included performance evaluations of the video-based emotion recognition system and qualitative analyses through Heuristic Evaluation and surveys. The results highlight TheraSense's effectiveness in improving user experience and emotion recognition in teleconsultation services.
My overall assessment is a major revision. This article's scope is comprehensive. It outlines a promising and innovative approach using deep learning for facial emotion recognition in a teleconsultation context, which is potentially impactful in mental health services. However, in its current form, the manuscript has serious weaknesses that need further improvement. It falls short in conveying the rigor and validity of its assessments, and the low-quality figures (some being very unclear) make it hard to understand. I suggest the authors carefully address these comments and suggestions. My specific suggestions and comments are listed below:
1. Please define the jargon and acronyms properly when they first appear in the text. For example, in Line 74, “…..states from speech, and highlights the capacity of SER as an initial step..” What is SER?
2. In the Abstract,” …..TheraSense, as a component of the CareSync framework….” It seems that CareSync is an important framework, but it is hard for the audience to understand what this system does exactly. In Line 102, the authors introduced CareSyn succinctly; I would suggest the authors either explain in full detail what CareSyn is or try to delete this term throughout the text.
3. Please revise Figures 3 and 4. In Figure 3, the TheraSense Architecture Model contains a spelling error (Fontend should be corrected to Frontend). High-quality figures are essential for effectively conveying technical details such as system interoperability and architecture. However, both Figures 3 and 4 suffer from inadequate visual representation, which undermines the clarity and professionalism of the work. I also suggest remaking one Figure containing both the Architecture model and Implementation Architecture.
4. Please clarify the assessment methods and results. Section 5.1 is clear; however, sections 5.2 and 5.3 lack clarity. Is there a table or figure that presents the assessment results more quantitatively?
5. There is no mention of external benchmarking or comparisons with existing emotion recognition systems. Is it possible to conduct additional analyses using commercial software such as i-Motion Affectiva to verify or benchmark the results of TheraSense’s emotion recognition system? Including this information would greatly help the system's effectiveness and provide evidence of its competitive advantages.
Author Response
Thank you for your helpful input and the time you took to provide these suggestions. We truly appreciate it!
Comments 1 : Please define the jargon and acronyms properly when they first appear in the text. For example, in Line 74, “…..states from speech, and highlights the capacity of SER as an initial step..” What is SER?
Response 1 : Thank you for your comment. The manuscript has been updated to define all terms and acronyms, including "SER" (Speech Emotion Recognition).
Comments 2 : In the Abstract,” …..TheraSense, as a component of the CareSync framework….” It seems that CareSync is an important framework, but it is hard for the audience to understand what this system does exactly. In Line 102, the authors introduced CareSyn succinctly; I would suggest the authors either explain in full detail what CareSyn is or try to delete this term throughout the text.
Response 2 : Thank you for your feedback. We have removed the term "CareSync" from the manuscript to ensure clarity and avoid confusion.
Comments 3 : Please revise Figures 3 and 4. In Figure 3, the TheraSense Architecture Model contains a spelling error (Fontend should be corrected to Frontend). High-quality figures are essential for effectively conveying technical details such as system interoperability and architecture. However, both Figures 3 and 4 suffer from inadequate visual representation, which undermines the clarity and professionalism of the work. I also suggest remaking one Figure containing both the Architecture model and Implementation Architecture.
Response 3 : The manuscript has been updated to address these concerns. Figure 5 now reflects the revised edge/cloud-based architecture. While vital signs monitoring using wearable sensors is planned for future work, the current focus remains on camera data processing with ML/DL methods. Additionally, the previous architecture design figure has been replaced with a user interface diagram in Figure 4 to better align with the updated architechture implementation.
Comments 4 : Please clarify the assessment methods and results. Section 5.1 is clear; however, sections 5.2 and 5.3 lack clarity. Is there a table or figure that presents the assessment results more quantitatively?
Response 4 : The manuscript has been updated to clarify the assessment methods and results in Sections 5.2 and 5.3. Additional tables have been included to present the assessment results more quantitatively, ensuring greater clarity and a more comprehensive understanding of the findings.
Comments 5 : There is no mention of external benchmarking or comparisons with existing emotion recognition systems. Is it possible to conduct additional analyses using commercial software such as i-Motion Affectiva to verify or benchmark the results of TheraSense’s emotion recognition system? Including this information would greatly help the system's effectiveness and provide evidence of its competitive advantages.
Response 5 : We understand the importance of benchmarking TheraSense against existing frameworks for emotion detection, such as i-Motion Affectiva. TheraSense was proposed to fill the gap in teleconsultation for mental health, providing real-time emotion detection through a cost-effective, web-based solution that integrates deep learning techniques like CNNs to enhance accuracy and scalability. The manuscript has been updated to include plans for future work that will compare TheraSense's performance in emotion detection with commercial systems.
Your input has been incredibly helpful in refining the paper. Thank you again!
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe presented article analyzes facial emotion recognition in the field of mental health teleconsultation.This paper presents a novel approach, TheraSense, an interface created as an element of the CareSync architecture. Facial emotion recognition plays an important role in the assistive technology to provide appropriate assistance to the needy person. Thus, this paper can be an advanced addition to the modern facial emotion recognition framework because the presented work uses deep learning techniques to analyze emotions. It is recommended to accept this article for publication after careful consideration of the following suggestions:
1. Add a separate paragraph in the introduction section that describes the fundamentals of facial emotion recognition technique and try to make a pictorial framework for the facial emotion recognition system.
2. Add a paragraph in the introduction section that will define your research motivation and research contribution. That means taking references to some related articles and describing how your work is providing better research outcomes when compared with previous works and what points motivate you to do this research work.
3. A comparative discussion is required inside the experimental section in the form of a graph. Try to change table 2 in the form of a graph for better understanding to the readers.
4. Add a separate section for future research trends before the conclusion section or add a separate paragraph for the future research trends inside the conclusion section.
Author Response
Thank you for your helpful input and the time you took to provide these suggestions. We truly appreciate it!
Comments 1 : Add a separate paragraph in the introduction section that describes the fundamentals of facial emotion recognition technique and try to make a pictorial framework for the facial emotion recognition system.
Response 1 : We have added a separate paragraph in the introduction section that explains the fundamentals of the facial emotion recognition technique used in our work. We also included Figure 1 titled "Process of Facial Emotion Recognition Using CNNs" to provide a pictorial framework illustrating how the facial emotion recognition system operates.
Comments 2 : Add a paragraph in the introduction section that will define your research motivation and research contribution. That means taking references to some related articles and describing how your work is providing better research outcomes when compared with previous works and what points motivate you to do this research work.
Response 2 : We added a paragraph in the introduction outlining our research motivation, contributions, and how our approach improves on previous studies.
Comments 3 : A comparative discussion is required inside the experimental section in the form of a graph. Try to change table 2 in the form of a graph for better understanding to the readers.
Response 3 : Following your suggestion, we replaced Table 2 with a graph in the experimental section for a clearer, more visual comparison.
Comments 4 : Add a separate section for future research trends before the conclusion section or add a separate paragraph for the future research trends inside the conclusion section.
Response 4 : We have included a section on future research trends in the conclusion. This section discusses potential improvements to the TheraSense platform.
Your input has been incredibly helpful in refining the paper. Thank you again!
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsI am glad to hear that my suggestions were helpful in refining the manuscript. I think the authors did a tremendous job revising it in such a short time frame. The paper's readability has been greatly enhanced. The text flows much more smoothly, making the findings more accessible to readers.
I also want to say that I appreciate the newly added Figure 1 and Figure 5 very much; they are not only visually appealing but also provide valuable context that strengthens the logic of the paper. Additionally, including the additional tables is a significant improvement that greatly enhances the overall impact of the findings. I hereby recommend the acceptance for publication in its current form.