Next Article in Journal
A Chinese Short Text Similarity Method Integrating Sentence-Level and Phrase-Level Semantics
Previous Article in Journal
PaCs: Playing Time-Aware Chunk Selection in Short Video Preloading
 
 
Article
Peer-Review Record

Facial Biosignals Time–Series Dataset (FBioT): A Visual–Temporal Facial Expression Recognition (VT-FER) Approach

Electronics 2024, 13(24), 4867; https://doi.org/10.3390/electronics13244867
by João Marcelo Silva Souza 1,2,*, Caroline da Silva Morais Alves 1,2,*, Jés de Jesus Fiais Cerqueira 2, Wagner Luiz Alves de Oliveira 2, Orlando Mota Pires 1, Naiara Silva Bonfim dos Santos 1, Andre Brasil Vieira Wyzykowski 1, Oberdan Rocha Pinheiro 1, Daniel Gomes de Almeida Filho 1, Marcelo Oliveira da Silva 1 and Josiane Dantas Viana Barbosa 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Electronics 2024, 13(24), 4867; https://doi.org/10.3390/electronics13244867
Submission received: 28 October 2024 / Revised: 26 November 2024 / Accepted: 30 November 2024 / Published: 10 December 2024
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents a novel approach to facial expression recognition through the development of the Facial Biosignals Time Series Dataset (FBioT) and a Visual-Temporal Facial Expression Recognition (VT-FER) methodology. While the research offers valuable insights, several issues and challenges can be highlighted throughout the study.

- The introduction is unfocused, doesn't get straight to the point, and includes too many details about the methodology without properly introducing the context.

- The related work section is not focused on similar studies, resulting in a heavy and difficult-to-read first part of the article. Additionally, some important mentions are missing, such as the availability of ecologically valid datasets, just as an example, Marcolin et al., "CalD3r and MenD3s: Spontaneous 3D facial expression databases"

- The authors state that the detailed development of the modules within methodology 1 will be explained in future works. Which step(s) is(are) truly novel? Aren't the described steps (face detection for example) well-known in literature?

- Results do not really prove the effectiveness of the proposed methodologies. (1) Happiness and Surprise are the easiest facial expression to be classified, the other facial expression classification results should be reported. (2) Using only two classes and classify facial expressions not belonging to these classes is meaningless. Other emotions should at least be classified as a third class named "Other". (3) Confusion matrix in Figure 33 is very far from the state-of-the-art recognition rate. I wonder if Prototype B is a contribution considering so poor results (and, once again, the presence of only two classes).

Author Response

 

Comments 1) The introduction is unfocused, doesn't get straight to the point, and includes too many details about the methodology without properly introducing the

Responses 1) We appreciate your feedback regarding the introduction. Based on your comments, we have restructured the introduction entirely, categorizing the problem, the contexts associated with Facial Expression Recognition (FER), the main challenges, how these challenges are currently being addressed, the existing gaps, and the contributions proposed by this work.

The revised introduction is now organized into several key sections focusing on context and proposal, including biosignals introduction and their categories; Facial Expression Recognition (FER) characteristics and their spatial and temporal challenges; a brief discussion on how datasets and other approaches currently address these challenges; the central aspect of our proposal, outlining the respective contributions and hypotheses; as well as a brief introduction to the results and the new challenges identified.

 

Comments 2) The related work section is not focused on similar studies, resulting in a heavy and difficult-to-read first part of the article. Additionally, some important mentions are missing, such as the availability of ecologically valid datasets, just as an example, Marcolin et al., "CalD3r and MenD3s: Spontaneous 3D facial expression databases"

Response 2) Thank you for your feedback regarding the Related Work section. Based on your suggestions, we have restructured this section to include the main categories of related studies, encompassing static and dynamic datasets. We have highlighted several dynamic datasets that align closely with the focus of this work, as well as the key features of our proposal and foundational concepts derived from related research, such as normalization and frontalization.

Additionally, in the “5.8. Results Discussion” section, we have added a comparative table with relevant considerations about similar datasets available within the scientific community.

We also appreciate your suggestion to include the work “Marcolin et al., CalD3r and MenD3s: Spontaneous 3D facial expression databases.” It has been included in studies that consider static aspects in image frames for FER applications.

 

Comments 3) The authors state that the detailed development of the modules within methodology 1 will be explained in future works. Which step(s) is(are) truly novel? Aren't the described steps (face detection for example) well-known in literature?

Response 3) Thank you for your feedback. We acknowledge that some aspects regarding the contributions and objectives of this work were not clearly presented.

In summary, as revised primarily in the Introduction, Related Work, and Results sections, this study focuses on temporal descriptors, which capture not only static or frame-by-frame aspects in a video but also the window corresponding to an expression. This proposal can potentially pave the way for new research in the underexplored area of temporal effects, addressing the various challenges highlighted in the article. What may truly represent a novel contribution are new applications, such as training neural networks (which were only prototyped/tested in this study), that can leverage this proposal in future research.

Some existing applications in the literature that require temporal component analysis could also benefit from this proposal, such as lip reading, tracking attention shifts, analyzing human facial expressions in Emotion Recognition in Conversations (ERC), Automated Lip Reading (ALR), and others. Additionally, new applications may emerge from temporal studies like this one, focusing, for example, on criteria from the fields of psychology, medicine, and human behavior.

 

Comments 4) Results do not really prove the effectiveness of the proposed methodologies. (1) Happiness and Surprise are the easiest facial expression to be classified, the other facial expression classification results should be reported. (2) Using only two classes and classify facial expressions not belonging to these classes is meaningless. Other emotions should at least be classified as a third class named "Other". (3) Confusion matrix in Figure 33 is very far from the state-of-the-art recognition rate. I wonder if Prototype B is a contribution considering so poor results (and, once again, the presence of only two classes).

 

Response 4) Thank you for your thoughtful feedback regarding the methodology. We would like to emphasize that, following the revisions based on your contributions, we have clarified the objectives of the proposal, which focuses on the dataset (FBioT) and the methodology for generating temporal aspects of expressions (VT-FER). The method's effectiveness is demonstrated through the results of individual modules and prototype tests of neural networks using reference datasets from the scientific community. We reiterate that the neural networks employed in this study aim to showcase experimental applications of the methodology's modules and the data generated by FBioT. Given the broad scope of this proposal, specializing and fine-tuning neural networks lies outside the scope of this article and is reserved for future work.

Additionally, based on your feedback, we have enhanced the neural network testing prototypes with the following adjustments:

1. We introduced a new prototype using the AFEW dataset (Prototype B, with FBioT now being Prototype C). For Prototype A, in which we apply the CK+ dataset, we revised it to address all seven emotions, updating the results and confusion matrices accordingly.

2. Although the prototypes mainly aim to demonstrate applicability, we achieved satisfactory results even with a simple network architecture.

3. The new Prototype B, based on the AFEW dataset, was developed, and despite using a simple network, it achieved performance levels comparable to the state-of-the-art.

These points suggest that in future studies, whether with Prototype B or C (using FBioT), it will be possible to specialize new neural networks to improve the state-of-the-art based on the methodology and dataset proposed in this work.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper is about temporal biosignals obtained visually to enable dynamic expression recognition in embedded systems. Here are my comments on the given text:

Please add a table and add all hyper parameters of your proposed model on it.
Please add a table in conclusion and compare your proposed method with other related works.
English language is acceptable in general, but there are some errors that should be corrected.

Author Response

Comments 1) Please add a table and add all hyper parameters of your proposed model on it.

Responses 1) Thank you for your feedback regarding the description of hyperparameters. We would like to emphasize that, following your suggestions, we have clarified the objectives of the proposal, which focuses on the dataset (FBioT) and the method for generating the temporal aspects of expressions (VT-FER).

The neural networks presented in the article primarily aim to demonstrate applicability using well-known datasets and the results derived from the proposed FBioT. Stress-testing and fine-tuning neural networks fall outside the scope of this work and are planned for future studies.

Based on these considerations and your feedback, we have included detailed information about the neural network results and provided a dedicated page in the Git repository (link: https://github.com/jomas007/biosignals-time-series-dataset/wiki/Neural-Network-Description#afew-arousal-neural-network) where the hyperparameter details are available for further review and consultation by the scientific community.

 

 

Comments 2) Please add a table in conclusion and compare your proposed method with other related works.

Responses 2) We appreciate your consideration regarding the comparison of our method with other works and datasets. In this regard, we would like to highlight that we have revised the introduction and related work sections to provide a comparative overview of the principles behind our proposal.

Additionally, we have added a new section, “5.8. Results Discussion,” which includes a comparison table of our proposal with other related works. This table is presented as Table 12.

A comparison of datasets based on video/image modalities, dynamic/static characteristics, controlled environments, and various annotation features such as front-facing positions, bias, movement, FACS, and time-series compatibility. Furthermore, a discussion of the criteria and groupings is included throughout this section.

 

Comments 3) English language is acceptable in general, but there are some errors that should be corrected.

Responses 3) We appreciate your feedback regarding the English writing. We have revised the entire sections such as the Introduction and Related Work, as well as made corrections and adjustments in the other sections.

 

 

Reviewer 3 Report

Comments and Suggestions for Authors

1.      Have additional validation tests been conducted on other public datasets, such as GFT or AFEW, which capture real-world expressions?

2.      Could alternative or supplementary landmark extraction methods, such as using deep learning-based models (e.g., MediaPipe or OpenFace), be considered?

3.      How does the model handle complex environmental variables, such as dynamic lighting or occlusions?

4.      Is it possible to integrate "virtual" landmarks to approximate missing FACS-relevant points like AU9 (nose wrinkles)?

5.      How does the time-series approach compare with image-based approaches in terms of latency and accuracy on embedded systems?

Author Response

 

Comments 1) Have additional validation tests been conducted on other public datasets, such as GFT or AFEW, which capture real-world expressions?

Responses 1) We appreciate your consideration regarding the additional validation tests with public datasets. We would like to emphasize that throughout this research, we requested access to various public datasets for comparative reference. We were granted access to several static datasets, which were not suitable for comparison with our proposal. From a dynamic dataset perspective, we initially had access only to the CK+ dataset, as presented in Prototype A.

We requested access to the GFT, MMI, and other datasets; However, we did not get any responses. More recently, we received access to the AFEW repository. Based on your feedback, we have strengthened our neural network test prototypes with the following updates: 1) A new prototype has been created using the AFEW dataset (this prototype is named B, while the one with the FBioT dataset is labeled as C); in Prototype A, using CK+, we restructured the model to account for all 7 emotions, with updated results and confusion matrices. We emphasize that the goal is to demonstrate applicability, and despite using a simple network, we achieved satisfactory results. The new prototype B, based on the AFEW dataset, was developed, and even with a simple network, it reached performance levels comparable to the state-of-the-art.

 

Comments 2) Could alternative or supplementary landmark extraction methods, such as using deep learning-based models (e.g., MediaPipe or OpenFace), be considered?

Responses 2) We appreciate your consideration regarding the landmark extraction methods. As discussed in the arguments presented in "Section 1. Introduction," "Section 2. Related Work," "Section 4.2. Feature Extractor Module," "Section 4.4. Measure Maker Module," "Section 5.1. Main Premises and Project Decisions," "Section 5.2. Measurements Proposal and Correlation with FACS," and "Section 6. Future Works and Perspectives," DLIB was used as a preliminary tool for validating the concept of temporal descriptors.

The choice of DLIB was based on its versatility across various environments, as well as the extensive body of literature that utilizes it for model validation.

In this work, "Section 5.3.1. Rotation Estimation" references OpenFace for comparing rotation results. Overall, it is important to note that once the hypothesis is validated in this work, future research will explore more accurate landmark methods, including those additional points and features, as discussed in "Section 6. Future Works and Perspectives."

 

Comments 3) How does the model handle complex environmental variables, such as dynamic lighting or occlusions?

Responses 3) We appreciate your consideration regarding how we handle the complexity of various environmental variables. Indeed, this is a very sensitive aspect when dealing with uncontrolled environments, as is the case in this work.

As presented in the sections dealing with spatio-temporal normalization and stabilization, such as in "Section 4.3. Video Adjuster Module" and the corresponding results in "Section 5.3. Video Adjuster," as well as in the feature extraction section "4.2. Feature Extractor Module," the handling of environmental complexities is based on several key principles:

  1. The first principle addresses spatial aspects. Face detection is handled by the extractor, and the quality of DLIB’s extraction determines the identification of facial landmarks, which guides the entire pipeline in this proposal. Therefore, the preliminary filtering step is performed on the landmarks by the extractor.
  2. The Video Adjuster Module then evaluates the accuracy of the extracted landmarks, identifying distortions based on the provided coordinates and performing stabilization estimation.
  3. If DLIB fails to detect the face or if spatial normalization cannot recover the face data, a discontinuity is flagged for that frame.
  4. If the number of discontinuities is below a specified threshold (as described in "Section 5.4.3. Measured Data - Dynamic Features"), interpolation is applied. Otherwise, a discontinuity is flagged, indicating the need for context adjustment in subsequent analysis modules.
  5. In future work, new extractors will be tested to improve these processes, with the goal of providing a more robust set of facial landmarks that are unbiased with respect to color, race, ethnicity, gender, and other factors.

 

 

Comments 4) Is it possible to integrate "virtual" landmarks to approximate missing FACS-relevant points like AU9 (nose wrinkles)?

Responses 4) We appreciate your question regarding Action Unit 9 (#AU9). We would like to emphasize that, currently, due to limitations in DLIB, we are unable to meet the criteria for AU9. In other words, the current practical approach does not allow for this. As a result, future work will focus on using landmark extractors with more points that cover the region where the "nose wrinkles" effect can be observed.

 

Comments 5) How does the time-series approach compare with image-based approaches in terms of latency and accuracy on embedded systems?

Responses 5) We appreciate your consideration regarding the latency and accuracy criteria of our proposal in comparison with others, such as image-based approaches. We would like to emphasize that these aspects will be addressed in future work, as indicated in "Section 6. Future Works and Perspectives."

To clarify the objectives of the project, we have restructured Sections "1. Introduction," "2. Related Work," and "6. Future Works and Perspectives." As a result, such a comparison in the current article is presented as a hypothesis to be validated in future research.

This hypothesis is based on theoretical and practical findings from related works in this area, as cited in "Section 2. Related Work."

 

 

Reviewer 4 Report

Comments and Suggestions for Authors

The paper titled "Facial Biosignals Time Series Dataset (FBioT): A Visual-Temporal Facial Expression Recognition (VT-FER) Approach." is interesting, but there are some major and minor flows that need to be addressed in the revised version.

1.     The authors primarily focus on a binary classification (happy vs. neutral), which may limit the generalizability of the findings across a broader range of emotional expressions. I strongly recommend checking the system with multiple emotions. See “Multilevel feature representation for hybrid transformers-based emotion recognition” and cited in the literature.

2.     The current challenges are not crystal clearly mentioned in the introduction section of this paper. I suggest adding a dedicated paragraph about the current challenges in this area, followed by the authors’ contribution to overcoming those challenges.

3.     What is the main difficulty when applying the proposed method? The authors should clearly state the limitations of the proposed method in practical applications and should be mentioned in the article's conclusion.

4.     Please discuss the hyperparameters setting of the proposed model and other comparison models.

5. The results indicate high accuracy, but there are signs of potential overfitting, especially with the limited data used for training, which could affect performance on unseen data, which the authors can address and need to explain in the paper as well.

6. The authors address complexities in real-world data, and further elaboration on specific challenges encountered during data collection and analysis is needed. Added some use cases in this regard.

7.     The paper needs a more thorough comparison with recent existing facial expression recognition systems to contextualize the contributions of the FBioT dataset and the VT-FER approach. Compare with “STT-Net: Simplified Temporal Transformer for Emotion Recognition”.

8. How can we use this algorithm for action-related tasks, such as “Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos” and “Artrivit: Automatic face recognition system using vit-based siamese neural networks with a triplet loss” discussed and cited accordingly.

9.     The complexity of the proposed model and the model parameter uncertainty are not mentioned.

Comments on the Quality of English Language

moderate

Author Response

 

Comments 1) The authors primarily focus on a binary classification (happy vs. neutral), which may limit the generalizability of the findings across a broader range of emotional expressions. I strongly recommend checking the system with multiple emotions. See “Multilevel feature representation for hybrid transformers-based emotion recognition” and cited in the literature.

Responses 1) Thank you for your thoughtful feedback regarding the methodology. We would like to emphasize that, following the revisions based on your contributions, we have clarified the objectives of the proposal, which focuses on the dataset (FBioT) and the methodology for generating temporal aspects of expressions (VT-FER). The method's effectiveness is demonstrated through the results of individual modules and prototype tests of neural networks using reference datasets from the scientific community. We reiterate that the neural networks employed in this study aim to showcase experimental applications of the methodology's modules and the data generated by FBioT. Given the broad scope of this proposal, specializing and fine-tuning neural networks lies outside the scope of this article and is reserved for future work.

Additionally, based on your feedback, we have enhanced the neural network testing prototypes with the following adjustments:

  1. We introduced a new prototype using the AFEW dataset (Prototype B, with FBioT now being Prototype C).
  2. For Prototype A, in which we apply the CK+ dataset, we revised it to address all seven emotions, updating the results and confusion matrices accordingly. Although the prototypes mainly aim to demonstrate applicability, we achieved satisfactory results even with a simple network architecture.
  3. The new Prototype B, based on the AFEW dataset, was developed, and despite using a simple network, it achieved performance levels comparable to the state-of-the-art.

These points suggest that in future studies, whether with Prototype B or C (using FBioT), it will be possible to specialize new neural networks to improve the state-of-the-art based on the methodology and dataset proposed in this work.

Additionally, we appreciate the suggestion of the article "Multilevel feature representation for hybrid transformers-based emotion recognition." This is a highly relevant foundational work for future developments and is on our radar for the following phases, as mentioned in "Section 6. Future Works and Perspectives."

 

Comments 2) The current challenges are not crystal clearly mentioned in the introduction section of this paper. I suggest adding a dedicated paragraph about the current challenges in this area, followed by the authors’ contribution to overcoming those challenges.

Responses 2) We appreciate your feedback regarding the introduction. Based on your comments, we have restructured the introduction entirely, categorizing the problem, the contexts associated with Facial Expression Recognition (FER), the main challenges, how these challenges are currently being addressed, the existing gaps, and the contributions proposed by this work.

The revised introduction is now organized into several key sections focusing on context and proposal, including biosignals introduction and their categories; Facial Expression Recognition (FER) characteristics and their spatial and temporal challenges; a brief discussion on how datasets and other approaches currently address these challenges; the central aspect of our proposal, outlining the respective contributions and hypotheses; as well as a brief introduction to the results and the new challenges identified.

 

 

Comments 3) What is the main difficulty when applying the proposed method? The authors should clearly state the limitations of the proposed method in practical applications and should be mentioned in the article's conclusion.

Responses 3) Thank you for your feedback regarding the challenges and limitations of our proposed method. Based on your suggestions, we have adjusted not only to "Section 7. Conclusions" but also to "Section 6. Future Works and Perspectives."

In summary, the main challenges of this proposal lie in the process of generating the "gold samples," which must be done manually. For instance, the current labeling of FBioT includes only two classes due to the need for manual identification of other movement signatures. Additionally, the difficulty of labeling public videos arises from the fact that they are typically associated with conversation videos. As a result, identifying and extracting emotions, particularly when they overlap with speech, becomes a complex task. This is an area that will be further explored in future research. On the other hand, with the seeds, the proposal has a scalable automatic process.

Other limitations discussed include issues with the feature extractor, and there are currently some action units, such as AU9 in FACS, that are not covered.

 

Comments 4) Please discuss the hyperparameters setting of the proposed model and other comparison models.

Responses 4) Thank you for your feedback regarding the description of hyperparameters. We would like to emphasize that, following your suggestions, we have clarified the objectives of the proposal, which focuses on the dataset (FBioT) and the method for generating the temporal aspects of expressions (VT-FER).

The neural networks presented in the article primarily aim to demonstrate applicability using well-known datasets and the results derived from the proposed FBioT. Stress-testing and fine-tuning neural networks fall outside the scope of this work and are planned for future studies.

Based on these considerations and your feedback, we have included detailed information about the neural network results and provided a dedicated page in the Git repository (link: https://github.com/jomas007/biosignals-time-series-dataset/wiki/Neural-Network-Description#afew-arousal-neural-network) where the hyperparameter details are available for further review and consultation by the scientific community.

 

Comments 5) The results indicate high accuracy, but there are signs of potential overfitting, especially with the limited data used for training, which could affect performance on unseen data, which the authors can address and need to explain in the paper as well.

Responses 5) Thank you for raising this important concern. The primary focus of this work is the development and presentation of the FBioT dataset, with the neural network experiments included as preliminary validation of its potential. These results demonstrate that the dataset supports time-series analysis of biosignals but are not intended as a contribution to advancing neural network methodologies. We acknowledge the risk of overfitting given the limited data used in these preliminary tests. Ongoing work is addressing this issue with more robust architectures and advanced evaluations, including ablation studies, cross-validation, data augmentation, and robustness testing against noise and out-of-distribution scenarios. Additionally, sensitivity analysis, hyperparameter optimization, and ensemble methods are being explored to improve generalizability and validate consistency. The dataset has been made available to facilitate further exploration and enable the development of more advanced models in future research.

 

Comments 6) The authors address complexities in real-world data, and further elaboration on specific challenges encountered during data collection and analysis is needed. Added some use cases in this regard.

Responses 6) Thank you for your feedback regarding the use cases. We would like to emphasize that the complexities, from the labeling process to landmark extraction, which relate to real-world issues, are addressed in "Section 1. Introduction," "Section 2. Related Work," "Section 5. Results," and "Section 6. Future Works and Perspectives." In general, these complexities stem from the difficulty of expressing and quantifying temporal features, as well as the challenge of manually creating temporal seeds.

In these sections, we have mentioned several well-known use cases from the literature that require temporal component analysis, which can also benefit from our approach, such as lip reading, tracking attention shifts, analyzing human facial expressions, Emotion Recognition in Conversations (ERC), Automated Lip Reading (ALR), and others. Moreover, new applications and use cases may arise from further temporal studies like the one proposed here, focusing on areas such as psychology, medicine, and human behavior.

 

 

Comments 7) The paper needs a more thorough comparison with recent existing facial expression recognition systems to contextualize the contributions of the FBioT dataset and the VT-FER approach. Compare with “STT-Net: Simplified Temporal Transformer for Emotion Recognition”.

 Responses 7) Thank you for your thoughtful feedback regarding comparing with other datasets and methodologies. We would like to emphasize that the focus of this work is initially on the dataset. To clarify this focus and the objectives of the paper, we have revised the entire "Section 2. Related Work." This section introduces how other datasets work and how our dataset differs from them.

Based on your feedback, we have added a new section, "5.8. Results Discussion," which includes a comparison table of our proposal with related works. This is presented in Table 12.

A comparison of datasets based on video/image modalities, dynamic/static characteristics, controlled environments, and various annotation features such as front-facing positions, bias, movement, FACS, and time-series compatibility. Additionally, there is a discussion throughout this section regarding the criteria and groupings.

Looking ahead, this work will explore new directions, and we anticipate that future research will focus on the specialization and fine-tuning of neural networks, as well as related approaches like transformers. Therefore, we appreciate the suggestion of the paper "STT-Net: Simplified Temporal Transformer for Emotion Recognition," which has been cited in "Section 6. Future Works and Perspectives" and will be considered in the following phases of our research.

 

Comments 8) How can we use this algorithm for action-related tasks, such as “Drone-HAT: Hybrid Attention Transformer for Complex Action Recognition in Drone Surveillance Videos” and “Artrivit: Automatic face recognition system using vit-based siamese neural networks with a triplet loss” discussed and cited accordingly.

Responses 8) Thank you for the question. The FBioT dataset is complementary to the methodologies in Drone-HAT and ARTriViT, providing a foundation for incorporating temporal and contextual biosignal information into their respective pipelines. For Drone-HAT, the integration of the FBioT dataset could enhance action recognition by introducing biometric-informed temporal signatures derived from facial movements. However, it is crucial to emphasize that the VT-FER process, used for generating the FBioT dataset, requires clear facial features to be present in the pixel space. If the drone footage does not capture faces with sufficient detail, our method cannot extract the necessary information to generate biosignal-based time series. This requirement is intrinsic to the methodology and must be met for successful implementation.   

Similarly, ARTriViT’s ViT-based Siamese framework could utilize the FBioT dataset for identity verification using biosignal-derived time series. This approach would enable privacy-preserving recognition by removing the need for raw image storage, addressing concerns related to data protection and ethical considerations. Although the integration of FBioT into Drone-HAT and ARTriViT is beyond the scope of this study, the dataset is designed to support such applications, providing researchers with the resources needed to advance these fields in meaningful ways.

 

Comments 9) The complexity of the proposed model and the model parameter uncertainty are not mentioned.

Responses 9) The VT-FER method reflects a deliberate tradeoff in complexity. While some stages demand significant computational effort during dataset generation, this reduces the complexity of subsequent tasks. The design ensures an efficient and practical process, enabling simpler models to analyze the generated time-series data effectively. This balance provides a robust representation of biosignals while maintaining usability across a range of applications.

Regarding parameter uncertainty, to the best of our knowledge, no established analytical or statistical methods are currently available to quantify uncertainty in time-series data derived from VT-FER. This represents an open area for future research, as such methods could enhance reliability and provide further insights into the generated data. We acknowledge this as an important consideration and are exploring it in our ongoing efforts.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have properly addressed all my comments and suggestions.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have addressed all comments satisfactorily. I recommend the manuscript for acceptance in its current form.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors successfully addressed my comments and suggestions. 

 

 

Comments on the Quality of English Language

minor

Back to TopTop