Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems

Appl. Sci. 2025, 15(8), 4509; https://doi.org/10.3390/app15084509

by Yao Fu

, Qiong Liu, Qing Song^*

, Pengzhou Zhang and Gongdong Liao

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(8), 4509; https://doi.org/10.3390/app15084509

Submission received: 25 February 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 19 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The research introduces Multi-HM, a multimodal Chinese dataset designed for emotion identification in human-machine interactions, hence improving AI's capacity to identify task-oriented emotions.

The paper needs more improvements to enhance its quality and presentation. Here are some suggested amendments.

The motivation of the work should be clearly stated.
What are the advancements and new results presented in this paper?
Please comment on the novelty of the paper.
Please update the introduction section with some updated relevant references.
Please the novelty of the work should be spotted.
The graphs need to be commented on and more clarifications are needed.
Reference 1, Not all the names of the authors are mentioned.
Proofreading is required.

Author Response

Comments 1:「The motivation of the work should be clearly stated.」Response 1: Thank you for your valuable comments. /We agree with this comment. Therefore, we have significantly enhanced and clarified the motivation of this research work at the beginning of the introduction section. Specifically, we emphasized the limitations of existing emotion datasets that primarily focus on human-to-human dialogue scenarios, failing to effectively capture the unique emotional dynamics and contextual associations in human-computer dialogue, especially in human-computer consultation scenarios; highlighted the necessity of building a dedicated emotion analysis dataset for human-computer consultation systems, and the potential application value of solving this problem for enhancing the emotional intelligence and user experience of human-computer interaction systems; and clearly elaborated our research starting point and objective of filling the gap in existing datasets and building a dataset that better fits the emotional analysis needs of real-world human-computer interaction scenarios. We hope that the revised introduction can more clearly demonstrate the necessity and significance of this research.

Comments 2:「What are the advancements and new results presented in this paper?」Response 2: Thank you for your question. We have presented the advancements and new results as follows:
(1) We constructed a high-quality multimodal human-machine dialogue emotion analysis dataset - Multi-HM. This dataset is large-scale, containing 2000 professionally annotated human-computer consultation dialogue samples; rich in modalities, covering text, speech, and visual modalities; high in annotation quality, adopting a five-dimensional annotation framework, systematically integrating text, audio, and visual modalities, and simulating real human-computer interaction workflows to encode pragmatic behavioral cues and task-critical emotional trajectories; specifically designed for the characteristics of human-computer consultation scenarios, explicitly designed for human-computer consultation systems, and covering 10 major HCI fields. It provides valuable data resources for subsequent related research.
(2) We proposed an effective emotion analysis algorithm framework for human-machine dialogue scenarios. This framework innovatively integrates multimodal information, adopts cross-modal representation learning and fusion methods, such as MISA, MulT, Self-MM, DMD, and FDR-MSA, and conducts comparative experiments; effectively utilizes structured emotion analysis constraints, such as dialogue context information and speaker information in human-computer interaction. Experimental results show that on the Multi-HM dataset, our proposed framework combined with Self-MM and FDR-MSA models achieved state-of-the-art performance, achieving better weighted F1 scores and accuracy than existing methods in both binary and seven-class classification tasks. It provides a new idea and method for solving the problem of emotion analysis in human-machine dialogue scenarios.

Comments 3:「Please comment on the novelty of the paper.」Response 3: Thank you for focusing on the novelty. We believe that the novelty of this paper is mainly reflected in the following aspects:
For the first time, we constructed a large-scale dataset Multi-HM specifically for multimodal human-machine dialogue emotion analysis, filling the gap in Chinese human-machine dialogue emotion datasets, especially in the field of human-computer consultation systems, and providing an important benchmark and support for subsequent research. The Multi-HM dataset is the first to be explicitly designed for human-computer consultation systems and contains 2000 high-quality, multimodal, five-dimensionally annotated dialogue samples, covering 10 major HCI fields, which makes it superior to existing general emotion datasets in terms of scale, quality, and domain specificity.
We proposed a systematic human-machine dialogue emotion analysis framework and conducted benchmark experiments and in-depth analysis on various cutting-edge multimodal fusion methods, providing valuable experimental results and methodological references for research in this field. We not only proposed the framework, but also systematically evaluated the performance of various representative fusion methods such as MISA, MulT, Self-MM, DMD, and FDR-MSA on the Multi-HM dataset, and through ablation experiments and visualization analysis, we deeply explored the impact of different modal information and fusion strategies on emotion recognition, providing a basis for subsequent model selection and optimization.
Experimental results show that our proposed framework combined with Self-MM and FDR-MSA models achieved significant performance improvements on the Multi-HM dataset, demonstrating the effectiveness and superiority of our method in human-machine dialogue emotion analysis tasks. In both binary and seven-class classification tasks, our framework achieved better weighted F1 scores and accuracy than existing baseline methods, which verifies the value of the Multi-HM dataset and also proves that our proposed framework can effectively cope with the challenges of emotion analysis in human-machine dialogue scenarios.

Comments 4:「Please update the introduction section with some updated relevant references.」Response 4: We agree with this comment. Therefore, we have added 5 recently published references closely related to the research topic of this paper to the introduction section. [The updated references can be found in the introduction section, page 1-2 ]

Comments 5:「Please the novelty of the work should be spotted.」Response 5: Thank you for pointing this out. /We agree with this comment. Therefore, we have further highlighted the novelty of our work in the introduction section after investigating other research in this field.

Comments 6:「The graphs need to be commented on and more clarifications are needed.」Response 6: Thank you for your careful review. /We agree with this comment. Therefore, we have comprehensively checked and revised all the graphs in the paper, adding more detailed comments and clarifications to the figure captions and in the main text where the figures are discussed.

Comments 7:「Reference 1, Not all the names of the authors are mentioned.」Response 7: Thank you for carefully checking the references. /We agree with this comment. Therefore, we have checked and completed the names of all authors for reference [1] (Scarselli et al., 2008), and re-examined all references in the full text to ensure the format is standardized and the information is complete.

Comments 8:「Proofreading is required.」Response 8: We agree with this comment. Therefore, we have carefully proofread the entire text, focusing on grammar errors, spelling errors, punctuation errors, and sentence fluency. We also paid special attention to the language expression in key sections such as the abstract, introduction, and conclusion, as well as the accuracy of details such as figure captions and reference lists, striving to improve the language quality and accuracy of expression of the paper.

Reviewer 2 Report

Comments and Suggestions for Authors

TECHNICAL ASPECT
~~~~~~~~~~~~~~~~
1. Figure captions should not be written as sentences and too descriptive, but precise and concise...For example, authors should not use "This figure presents..." (appears in Figure 1, Figure 2...) but only the essence of the figure content explanation.

2. Why table has caption as figure (example: Figure 5)?

3. Figure 7. consists of labels and numbers that are provided with too small font and therefore, hardly readable.

4. It is common practice to have a brief introduction into a visual representation of a result before the figure and after the figure there is discussion about the results presented in that figure. This approach should be applied in the whole manuscript, but particularly related to Figure 7 (no text after the figure 7 is provided, but only a compound text completely before the Figure 7).

CONTENT
~~~~~~~~
1. Title and contribution - it is quite unusual to have contribution emphasis on Dataset, prepared for some research. More often, in research papers there are some novel methods or case studies regarding application of some approaches and methods. Authors should reconsider the title and contribution - to put emphasis on proposing a method, instead of preparation of a dataset. In introduction (lines 83-87) authors propose a framework for human-computer dialogue scenarios, which integrates different methods, so this should be the focus in title, abstract and keywords.

2. Abstract -
2.1. the contribution is stated in lines 12-13 that "multi-HM-trained models achieve performance..." - this is not precise enough to say "they achieve performance" as if they achieve ANY performance...but performance should be compared to others so there could be stated it is BETTER performance THEN ...
2.2. The focus of this manuscript is not clear - in line 4 there is "domain specific", "context sensitive", "emotional dynamics", "behavioral patterns", in line 9 there are "10 major HCI domains", in line 14 there is "evolving emotional needs".

3. Keywords - not aligned with abstract. The focus of keywords is set too generally as emotion/sentiment recognition and analysis, while abstract have other key concepts emphasized. Abstract, Keywords and the whole manuscript should be aligned more precisely.

4. Citing the sources - in Introduction in lines 20-29 there is background knowledge about sentiment analysis briefly explained, but without any citation to support these statements. Generally, it is a common approach to have key statements supported by sources and to avoid having texts with non-supported statements.

5. Introduction -
5.1. In the first paragraph containing the intro to basic terms, there should be explanation to the term "multimodal". Currently it is briefly introduced in lines 28-29 putting it in the context of "verbal, paraverbal and nonverbal cues". Does multimodal mean different integration of different types of multimedia - text, video, images, audio?
5.2. Currently, there are background, short literature review, research gap, contribution, but the short intro to the rest of the paper is missing.
5.3. Introduction should have usual structure organized in paragraphs. Currently:
5.3.1. The short literature review starts in line 29 after the background knowledge text. These should be separated in two different paragraphs.
5.3.2. contributions starts in line 76 after the text related to research gap. These should be separated in different paragraphs.
5.4. All key terms should be briefly explained in the introduction or all background could be shifted into separate section after introduction section.

6. Related work
It is not a common practice to have existing datasets presented in related work. However, since the title of this manuscript is related to creating a dataset, then it could be suitable. Still, I think that the title is not precise enough, since the authors propose the framework, not just dataset. According to my opinion, having only dataset as a contribution could not be considered as a proper scientific contribution.

7. The use of terms
7.1. Some terms are confusing, such as "ecological validity" - why ecology here? is it sentiments/emotions and communication in focus!?
7.2. Some words are similar, but not explained why are they used as different, such as sentiment and emotion (particularly they are presented as almost the same at figure 5).
7.3. Sometimes words are used imprecisely and in non-relatable context. For example, this sentence brings non-relatable words together (line 228-229): "an annotated instance might be: { Human, Task-Oriented, Question, Negative, Confusion}." - this combination of words seem to be non-relatable, but it is an instance of a possible situation where human is task oriented, sets question...Authors should explain:
7.3.1. What annotated instance means should be explained as a term
7.3.2. Authors should pragmatically explain utilization of an instance with some examples in the human-machine interaction
7.3.3. If we talk about human-computer interaction, as announced by the manuscript title, then an instance should not consist only of a task oriented human, but also should include machine being somehow oriented or initiated by human etc.

8. Example
Figure 6. is said to be a concrete example of the proposed framework utilization, but in fact it is the framework representation with integration of different methods and approaches.

9. The proposed approach
The proposed framework is based on integration of different multimedia and methods in sentiment analysis within human-machine interaction. The integration itself should have been explained in 4.2. Multimodal Fusion, but it was not. There is confusing text that mentions "cross-modal" representation learning and fusion methods. This text within 4.2. is not clear and precise enough - how all these methods and multimedia are merged?

10. Experimental setup
Details about making dataset splits and selection of parameters are provided in 5.2. Evaluation Metrics. However, it is a common practice to:
10.1. Have dataset defined within the experimental setup section, not as separate section (currently provided under 3. Multi-HM Dataset - this should be shifted under Experimental Setup, subsection - DataSet preparation).
10.2. Dataset used in this manuscript is entitled Multi-HM and should consist of multiple multimedia from different sources (Large Language model GPT 4, real human actors etc.). It is common practice to have exact number and sources of instances in the dataset - how many video recordings, audio, texts...and from which sources. In the abstract it is said that 2000 professionally annotated dialogues were obtained, but later in the text the exact obtaining amount and sources for the Multi_HM dataset was not specified.
10.3. Usually, the proposed method (here it is the framework) should be put before the experiment and experimental setup section.
10.4. Evaluation Metrics should be provided in this manuscript, but not including dataset splits (they should be provided together with dataset preparation section). Evaluation metrics are related to the evaluation of the proposed framework performance. Currently, hyperparameter selection is provided, but it is not clear what these parameters are related with? Later, some "testset weighted F1 score" is mentioned in table 4 caption, but was not explained before the table with results, in the evaluation metrics subsection.

11. Conclusion
11.1. This manuscript announced in the abstract that will deal with 10 major HCI domains. In the manuscript text there was not mentions and elaborations of what are these "10 major HCI domains".
11.2. Authors announced in the abstract "mission-critical emotional trajectories", but the proposed framework and data set was not provided in this context.

Comments on the Quality of English Language

1. Typing errors and words selection - Data Bulid should be "Data Build", but even better - "Data Preparation"
2. Capitalization - Titles and subtitles should be written with Title Case. For example, in line 126, there is "Multimodal Sentiment analysis methods", should be replaced with "Multimodal Sentiment Analysis Methods".

Author Response

Comments 1: [Figure captions should not be written as sentences and too descriptive, but precise and concise…For example, authors should not use “This figure presents…” (appears in Figure 1, Figure 2…) but only the essence of the figure content explanation.]

Response 1: [Thank you to the reviewer for the guidance on figure captions. We agree with this comment. Therefore, we have comprehensively revised all figure captions throughout the manuscript based on the reviewer’s feedback. We have replaced the original sentence-style or descriptive captions with more concise and accurate titles. For example, we have revised “Figure 1. This figure presents an example of human-to-human dialogue from the MELD dataset” to “The Human-Human Conversation dialogue from the MELD Dataset”. These changes can be found throughout the revised manuscript in the captions of all figures.] “[Figure 1. Example of human-to-human dialogue in the MELD dataset]”

Comments 2: [Why table has caption as figure (example: Figure 5)?]

Response 2: [Thank you to the reviewer for your insightful observation. We agree with this comment. The reason the table caption was labeled as “Figure 5” is because the original intention was to design this table as a more visual and information-dense “infographic” rather than a traditional table. To more effectively present the complex information of the five-dimensional annotation framework, we utilized various visual elements such as shapes, fonts, and colors for differentiation and emphasis, making it resemble a colorful icon visualization. This was intended to allow readers to understand the annotation system more intuitively. ]

Comments 3: [Figure 7. consists of labels and numbers that are provided with too small font and therefore, hardly readable.]

Response 3: [Thank you to the reviewer for your meticulous review and for pointing out the readability issue in Figure 7. We agree with this comment. Therefore, based on your suggestion, we have replaced all labels and numbers in Figure 7 with a larger font size to improve readability. This change can be found in Figure 7 of the revised manuscript.] “[Please see revised Figure 7 in the manuscript]

Comments 4: [It is common practice to have a brief introduction into a visual representation of a result before the figure and after the figure there is discussion about the results presented in that figure. This approach should be applied in the whole manuscript, but particularly related to Figure 7 (no text after the figure 7 is provided, but only a compound text completely before the Figure 7).]

Response 4: [Thank you to the reviewer for your guidance on the presentation standards of figures and tables. We agree with this comment and fully understand the academic writing convention of “introduction before figures/tables, discussion after figures/tables” that you have pointed out. Therefore, based on your suggestion, we have systematically adjusted the presentation of figures and tables throughout the entire manuscript to ensure that each figure and table follows the structure of “briefly introducing the figure/table content and purpose before the figure/table, and providing a detailed discussion of the results and analysis presented in the figure/table after the figure/table.” This revision can be found throughout the manuscript, particularly around all figures and tables, including Figure 7.] “[Please see the revised manuscript for the adjusted text around all figures and tables, including Figure 7]”

----

Comments 1: [Title and contribution - it is quite unusual to have contribution emphasis on Dataset, prepared for some research. More often, in research papers there are some novel methods or case studies regarding application of some approaches and methods. Authors should reconsider the title and contribution - to put emphasis on proposing a method, instead of preparation of a dataset. In introduction (lines 83-87) authors propose a framework for human-computer dialogue scenarios, which integrates different methods, so this should be the focus in title, abstract and keywords.]

Response 1: [Thank you to the reviewer for raising this valuable point again. We agree with this comment and fully understand the reviewer’s suggestion regarding the need for a more concise and clear explanation of the dataset’s rationale and necessity in the introduction. Therefore, based on your advice, we have made significant structural adjustments to the introduction. We have moved the discussion of the dataset’s necessity, which was originally dispersed in the latter part of the introduction, to the earlier paragraphs of the introduction. This reorganization and clarification can be found in the revised introduction section of the manuscript.] “[Please see the revised Introduction section in the manuscript]”

Comments 2.1: [Abstract - 2.1. the contribution is stated in lines 12-13 that “multi-HM-trained models achieve performance…” - this is not precise enough to say “they achieve performance” as if they achieve ANY performance…but performance should be compared to others so there could be stated it is BETTER performance THEN …]

Response 2.1: [Thank you to the reviewer for your meticulous language editing. We agree with this comment. Based on your feedback, we have revised the sentence to be semantically more complete and grammatically more standard. This revision is reflected in the Abstract, lines 12-13 of the revised manuscript.] “[Please see the revised Abstract, lines 12-13 in the manuscript]”

Template:
Comments 2.2: [Abstract - 2.2. The focus of this manuscript is not clear - in line 4 there is “domain specific”, “context sensitive”, “emotional dynamics”, “behavioral patterns”, in line 9 there are “10 major HCI domains”, in line 14 there is “evolving emotional needs”.]

Response 2.2: [We have re-examined and restructured these keywords within the manuscript to provide greater clarity and focus. These revisions are present throughout the Abstract and Introduction sections of the revised manuscript.] “[Please see the revised Abstract and Introduction sections in the manuscript for the clarified use of keywords]”

Template:
Comments 2.3: [Keywords - not aligned with abstract. The focus of keywords is set too generally as emotion/sentiment recognition and analysis, while abstract have other key concepts emphasized. Abstract, Keywords and the whole manuscript should be aligned more precisely.]

Response 2.3: [Thank you to the reviewer for pointing out the inconsistency between the keywords and the abstract. We agree with this comment. Based on your feedback, we have re-examined and revised the keywords of the manuscript to ensure they are more closely aligned with the abstract and the core content of the paper. The revised keywords are now more specific and focused. The updated keywords can be found in the Keywords section of the revised manuscript.] “[Please see the revised Keywords section in the manuscript]”

Comments 3: [Keywords - not aligned with abstract. The focus of keywords is set too generally as emotion/sentiment recognition and analysis, while abstract have other key concepts emphasized. Abstract, Keywords and the whole manuscript should be aligned more precisely.]

Response 3: [Thank you for pointing this out. We agree with this comment. Therefore, we have re-examined and revised the keywords of the manuscript to ensure they are more closely aligned with the abstract and the core content of the paper. The revised keywords are now more specific and focused. The updated keywords can be found in the Keywords section of the revised manuscript.] “[Please see the revised Keywords section in the manuscript]”

Comments 4: [Citing the sources - in Introduction in lines 20-29 there is background knowledge about sentiment analysis briefly explained, but without any citation to support these statements. Generally, it is a common approach to have key statements supported by sources and to avoid having texts with non-supported statements.]

Response 4: [Thank you to the reviewer for your rigorous academic standards and requirements. We agree with this comment. Based on the reviewer’s feedback, we have added relevant references to the descriptions of background knowledge about sentiment analysis in the introduction section. These additions can be found in the Introduction section, lines 20-29 of the revised manuscript.] “[Please see the revised Introduction section, lines 20-29 in the manuscript for the added citations.]”

Comments 5:[Introduction - 5.1. In the first paragraph containing the intro to basic terms, there should be explanation to the term “multimodal”. Currently it is briefly introduced in lines 28-29 putting it in the context of “verbal, paraverbal and nonverbal cues”. Does multimodal mean different integration of different types of multimedia - text, video, images, audio?
5.2. Currently, there are background, short literature review, research gap, contribution, but the short intro to the rest of the paper is missing.
5.3. Introduction should have usual structure organized in paragraphs. Currently:
5.3.1. The short literature review starts in line 29 after the background knowledge text. These should be separated in two different paragraphs.
5.3.2. contributions starts in line 76 after the text related to research gap. These should be separated in different paragraphs.
5.4. All key terms should be briefly explained in the introduction or all background could be shifted into separate section after introduction section.]
Response 5: [Thank you to the reviewer for your detailed and constructive feedback on the Introduction section. We agree with all of these comments and have revised the introduction accordingly to address each point:
Regarding comment 5.1 (Multimodal Definition): We agree that the term “multimodal” needed further clarification. Therefore, we have expanded the explanation of “multimodal” in the introduction to explicitly state that it refers to the integration of different types of multimedia such as text, video, images, and audio. This expanded clarification can be found in the Introduction section, lines 28-29 of the revised manuscript.
Regarding comment 5.2 (Missing Paper Structure Overview): We acknowledge the absence of a roadmap for the paper’s structure in the original introduction. To address this, we have added a brief introductory overview outlining the structure of the paper at the end of the introduction section. This new overview provides a clear guide to the subsequent sections of the manuscript and can be found at the end of the revised Introduction section.
Regarding comment 5.3.1 (Paragraph Separation - Literature Review): We agree that separating the background knowledge and literature review into distinct paragraphs improves readability. We have separated the background knowledge text and the short literature review into distinct paragraphs within the Introduction section. This paragraph separation is implemented in the revised Introduction section, starting from line 29.
Regarding comment 5.3.2 (Paragraph Separation - Contributions): We concur with the reviewer’s suggestion to separate research gap and contributions. We have separated the text discussing the research gap and the contributions into different paragraphs within the Introduction section to enhance structural organization. This paragraph separation can be found in the revised Introduction section, starting from line 76.
Regarding comment 5.4 (Explanation of Key Terms): We agree that key terms should be clearly defined. We have opted to briefly explain key terms directly within the introduction as they are first introduced, providing immediate context for readers. These key term explanations are now integrated throughout the revised Introduction section.
These revisions to the Introduction section can be found throughout the revised manuscript as described above.] “[Please see the revised Introduction section of the manuscript for all the described changes.]”

Comments 6: [Related work - It is not a common practice to have existing datasets presented in related work. However, since the title of this manuscript is related to creating a dataset, then it could be suitable. Still, I think that the title is not precise enough, since the authors propose the framework, not just dataset. According to my opinion, having only dataset as a contribution could not be considered as a proper scientific contribution.]

Response 6: [Thank you to the reviewer for this comment. We understand your point regarding the presentation of existing datasets within the related work section. We agree that while it is not conventional practice, it is justifiable in this instance given the dataset-centric focus of our manuscript as currently framed. We acknowledge the reviewer’s broader concern that focusing solely on the dataset as a contribution may not fully represent the scientific contribution of our work, and we appreciate this valuable feedback which aligns with your previous comments regarding the title and contribution emphasis. We are considering these points carefully and will address them in our overall revisions to better reflect the framework and methodology proposed in our manuscript.] “[We are currently revising the manuscript to better emphasize the framework and methodology. Please see the revised manuscript for these changes.]”

Comments 7:
[The use of terms - 7.1. Some terms are confusing, such as “ecological validity” - why ecology here? is it sentiments/emotions and communication in focus!?
7.2. Some words are similar, but not explained why are they used as different, such as sentiment and emotion (particularly they are presented as almost the same at figure 5).
7.3. Sometimes words are used imprecisely and in non-relatable context. For example, this sentence brings non-relatable words together (line 228-229): “an annotated instance might be: { Human, Task-Oriented, Question, Negative, Confusion}.” - this combination of words seem to be non-relatable, but it is an instance of a possible situation where human is task oriented, sets question…Authors should explain:
7.3.1. What annotated instance means should be explained as a term
7.3.2. Authors should pragmatically explain utilization of an instance with some examples in the human-machine interaction
7.3.3. If we talk about human-computer interaction, as announced by the manuscript title, then an instance should not consist only of a task oriented human, but also should include machine being somehow oriented or initiated by human etc.]

Response 7: [Thank you to the reviewer for your thorough and insightful feedback regarding the use of terminology in our manuscript.

The primary focus of our paper is to address the problem of dialogic sentiment analysis in human-computer interaction (HCI) scenarios, where the AI Machine's dialog system in our work specifically refers to a machine that is human-directed or human-initiated. Furthermore, we have systematically reorganized and redefined the terminology and key concepts throughout the manuscript to ensure precision and clarity.] “[Please see the revised manuscript, particularly in the sections mentioned above, for all the terminology clarifications.]”

Comments 8: [Example - Figure 6. is said to be a concrete example of the proposed framework utilization, but in fact it is the framework representation with integration of different methods and approaches.]

Response 8: [Yes, indeed it is a framework integrating different multimodal fusion methods.]

Comments 9: [The proposed approach - The proposed framework is based on integration of different multimedia and methods in sentiment analysis within human-machine interaction. The integration itself should have been explained in 4.2. Multimodal Fusion, but it was not. There is confusing text that mentions “cross-modal” representation learning and fusion methods. This text within 4.2. is not clear and precise enough - how all these methods and multimedia are merged?]

Response 9: [We used mainstream multimodal fusion methods in multimodal sentiment analysis to use in the framework as the representation of multimodal fusion.]

Comments 10: [Experimental setup - Details about making dataset splits and selection of parameters are provided in 5.2. Evaluation Metrics. However, it is a common practice to:
10.1. Have dataset defined within the experimental setup section, not as separate section (currently provided under 3. Multi-HM Dataset - this should be shifted under Experimental Setup, subsection - DataSet preparation).
10.2. Dataset used in this manuscript is entitled Multi-HM and should consist of multiple multimedia from different sources (Large Language model GPT 4, real human actors etc.). It is common practice to have exact number and sources of instances in the dataset - how many video recordings, audio, texts…and from which sources. In the abstract it is said that 2000 professionally annotated dialogues were obtained, but later in the text the exact obtaining amount and sources for the Multi_HM dataset was not specified.
10.3. Usually, the proposed method (here it is the framework) should be put before the experiment and experimental setup section.
10.4. Evaluation Metrics should be provided in this manuscript, but not including dataset splits (they should be provided together with dataset preparation section). Evaluation metrics are related to the evaluation of the proposed framework performance. Currently, hyperparameter selection is provided, but it is not clear what these parameters are related with? Later, some “testset weighted F1 score” is mentioned in table 4 caption, but was not explained before the table with results, in the evaluation metrics subsection.]

Response 10: [The exact number of entries of the Multi_HM dataset is reported in the tables later, and the details and criteria of data splitting and, and the evaluation metrics of the proposed framework in this paper are introduced.]

Conclusion 11[11.1. This manuscript announced in the abstract that will deal with 10 major HCI domains. In the manuscript text there was not mentions and elaborations of what are these “10 major HCI domains”.
11.2. Authors announced in the abstract “mission-critical emotional trajectories”, but the proposed framework and data set was not provided in this context.]

Response 11: [HCI domains include inquiry, changing information, executing instructions etc., we have supplemented this point in the revised manuscript. We fully agree that the innovative concept of “mission-critical emotional trajectories” needs more complete theoretical framework and data support, but actually what we proposed in the paper called mission-critical emotional trajectories is not appropriate, we think it is task-driven dialogue sentiment analysis, we have corrected this statement in the revised manuscript.] “[Clarifications and corrections regarding HCI domains and terminology have been added to the revised manuscript]”

Reviewer 3 Report

Comments and Suggestions for Authors

The authors present Multi-HM, a multimodal emotion recognition dataset explicitly designed for human-machine (HM) dialogue systems, containing 2,000 professionally annotated dialogues from 10 major HCI domains. It integrates textual, vocal and visual modalities. The data set is interesting and relevant. I have some comments to improve the quality of the manuscript.

Comments:

The rationale and need for the dataset should be more succinctly explained in the first few paragraphs of the Introduction section. Explain up front what is specifically lacking in current state-of-the-art datasets. The following paragraphs could be used to expand on the arguments. As written, the introduction introduces many concepts before explaining why this dataset is needed.
Incomplete sentence: “multi-HM-trained models achieve performance in recognizing…”. It should be “multi-HM-trained models achieve higher performance in recognizing”, or a similar phrase.
Remove redundant abbreviation definitions. Some abbreviations are defined multiple times (e.g. HHC).
Clarify the number of participants and the number of dialogs they participated in. It is not clear whether each participant had only one dialog or how many dialogues each participant had.
Clarify how the data was split, specifically whether it was by dialog or by subject, and address the potential for data leakage. Currently, it is not clear how the data were split into training/validation/test subsets, specifically whether such splitting was done by dialog or by subject. When dealing with models that aim to provide generalizable predictions (as opposed to individualized predictions), the splitting process must be done on a subject basis for validation to be meaningful. This is potentially the case in this study if you have multiple samples for each subject. See: doi: 10.1093/gigascience/gix019. Splitting the data without taking subjects into account leads to data leakage (data from a given subject is used to train the model than later used to estimate the subject's emotions) and limits the interpretation of the results.
Reference missing in “EMT-DLFR[? ]”.

Author Response

Comments 1: [The rationale and need for the dataset should be more succinctly explained in the first few paragraphs of the Introduction section. Explain up front what is specifically lacking in current state-of-the-art datasets. The following paragraphs could be used to expand on the arguments. As written, the introduction introduces many concepts before explaining why this dataset is needed.]
Response 1: [We have more succinctly elaborated the rationality and necessity of the dataset in the first few paragraphs of the introduction. At the beginning, the key elements lacking in the current state-of-the-art datasets are clearly pointed out. The subsequent paragraphs further elaborated on the relevant arguments, enabling readers to more clearly understand the innovation of this dataset. Thank you for your suggestion, we have comprehensively optimized the introduction section.] “[Please see the revised Introduction section of the manuscript for the updated rationale and necessity of the dataset]”

Comments 2: [Incomplete sentence: “multi-HM-trained models achieve performance in recognizing…”. It should be “multi-HM-trained models achieve higher performance in recognizing”, or a similar phrase.
Remove redundant abbreviation definitions. Some abbreviations are defined multiple times (e.g. HHC).]

Response2: [We have modified similar expressions in the text to ensure that all sentences are more semantically complete, clear, and accurately convey what we want to express.] “[These modifications can be found throughout the revised manuscript]

Comments [3]: [Clarify the number of participants and the number of dialogs they participated in. It is not clear whether each participant had only one dialog or how many dialogues each participant had.]

Response [3]: [Thank you for pointing this out. We agree with this comment. Therefore, we have carefully checked the full text and removed all redundant abbreviation definitions, ensuring that each abbreviation is defined only once, avoiding redundancy and improving the readability of the text.] “[These changes can be found throughout the revised manuscript, wherever abbreviations were previously redundantly defined.]”

Comments [4]: [Clarify how the data was split, specifically whether it was by dialog or by subject, and address the potential for data leakage. Currently, it is not clear how the data were split into training/validation/test subsets, specifically whether such splitting was done by dialog or by subject. When dealing with models that aim to provide generalizable predictions (as opposed to individualized predictions), the splitting process must be done on a subject basis for validation to be meaningful. This is potentially the case in this study if you have multiple samples for each subject. See: doi: 10.1093/gigascience/gix019. Splitting the data without taking subjects into account leads to data leakage (data from a given subject is used to train the model than later used to estimate the subject’s emotions) and limits the interpretation of the results.]

Response [4]: [We are deeply grateful to you for pointing out this crucial issue. In constructing emotion analysis models aimed at generalizable predictions (rather than individualized predictions), a subject-based splitting strategy is indeed essential to ensure the effectiveness of model validation. As highlighted by the reference you provided (doi:10.1093/gigascience/gix019), when each subject contains multiple samples, failing to consider subject information during data partitioning will lead to a significant data leakage problem. Specifically, data from the same subject may inadvertently appear in both the training and validation (or test) sets.

This data leakage can cause the following critical problems:

Overestimated model performance: The model learns feature patterns specific to certain subjects during training. Consequently, when validating or testing data from the same subjects, it can exhibit high accuracy even without truly understanding the essence of emotion. This prevents the evaluation results from genuinely reflecting the model’s generalization ability on unseen subjects.
Limited interpretability of results: As the model may merely memorize the emotional tendencies of specific subjects rather than learning universal emotion expression patterns, it becomes difficult to understand the true basis upon which the model makes predictions. This hinders in-depth analysis and improvement of model behavior.

We fully agree with your point: Although the model might learn features associated with specific subjects during training annotation, in real-world application scenarios, the model typically cannot know the subject information when predicting the emotions of new samples. Therefore, to more scientifically evaluate the model’s generalization ability in real-world settings, it is imperative to simulate this unknown-subject prediction environment by constructing independent training and validation (or test) sets through subject-based splitting. This approach can more accurately measure whether the model has truly mastered cross-subject emotion recognition ability, thereby providing a more reliable basis for the model’s actual deployment.

Comments [5]: [Reference missing in “EMT-DLFR[? ]”.]

Response [5]: [We have corrected the reference problem in “EMT-DLFR[? ]”, and carefully checked the full text to ensure that the references in other parts are also accurate.] “[The corrected reference can be found in the revised manuscript where “EMT-DLFR” is mentioned, and all references have been checked throughout the manuscript]”

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

I would like to thank the authors for addressing my comments and suggestions. Before publication, please correct minor issues in the manuscript, such as "Work investigates the recognition", should be "This work...".

Author Response

Comments 1: I would like to thank the authors for addressing my comments and suggestions. Before publication, please correct minor issues in the manuscript, such as “Work investigates the recognition”, should be “This work…”.
Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have corrected the phrase “Work investigates the recognition” to “This work investigates the recognition”. This change can be found on [Page 5, Line 163].

Article Menu

Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems

Further Information

Guidelines

MDPI Initiatives

Follow MDPI