Subjective Evaluation of Generative AI-Driven Dialogues in Paired Dyadic and Topic-Sharing Triadic Interaction Structures
Round 1
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsThank you for the updates to the manuscript. The revisions improve clarity, and I appreciate the effort.
Regarding Comment 3, the five levels of emotional expression in Figure 2 are valuable for feedback analysis. However, they are not fully integrated into the final conclusions. Strengthening this integration with mathematical metrics or visual charts would enhance the study’s impact and provide clearer quantitative insights.
Author Response
Dear reviewer,
Thank you very much once again for taking the time to review our manuscript. We appreciate this valuable comment. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.
Comment: Regarding Comment 3, the five levels of emotional expression in Figure 2 are valuable for feedback analysis. However, they are not fully integrated into the final conclusions. Strengthening this integration with mathematical metrics or visual charts would enhance the study’s impact and provide clearer quantitative insights.
Response: Thank you for pointing this out. We have added the numerical results for the five levels of emotional expression at the end of section 3.3 and revised the relevant text in the discussion section as follows:
(page 10, line 310) “The facial expression changed 18.0 ± 4.4 times during the three dialogues for each subject in the experiment, and the number of displays for each expression level was as follows, from the most negative to the most positive: 0, 2.5 ± 1.7, 4.2 ± 1.9, 7.1 ± 2.1, and 4.2 ± 1.4 times.”
(page 12, line 395) “In addition, the emotional expression function was active with mainly positive expressions during the dialogues. Although the effect in non-emotional dialogues was merely that the system was perceived as natural in this study, handling emotions is important in terms of addressing individuals. Emotional expression functions should be explored with affective computing technologies, including human emotion recognition and sentiment analysis, which aim to identify and express emotions and respond intelligently to human emotions [29].”
Sincerely,
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsUsing the word subjective connotes not systematic, which is not a good research project. The abstract is confusing as it contradicts failure and success; perhaps not giving so much detail would help.
The introduction should be brief, stating the main issue and context, and what your objectives are. Most of the intro should be in a separate literature review following the intro.
Your intro should not tell the results: the ending. The general aim would be in the intro, and the finalized research questions would be at the end of the lit review.
You should also explain dyadic and triadic interaction structures in the lit review, not in the methodology.
Your methodology should start with stating how you are approaching the issue (having participants interact with a GAI conversation), who are participants are and how you chose them, then how you will measure the interaction, and THEN explaining the tool and how you chose it. Then how it works.
All your subjects are adults, which does not align with your aim to see how interactions are related to child development. You should have focused your study on ADULT interaction behaviors, not children.
Author Response
Dear reviewer,
Thank you very much once again for taking the time to review our manuscript. We appreciate these valuable comments. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.
Comments: Using the word subjective connotes not systematic, which is not a good research project. The abstract is confusing as it contradicts failure and success; perhaps not giving so much detail would help.
Response: Thank you for pointing this out. We have added the following text in the conclusion section:
(page 13, line 416) “In the examination, it plans to consider emotions as an important factor and include quantitative data from temporal and biometric measurements to enhance validity.”
Additionally, we have corrected the text in the abstract as follows:
(page 1, line 19) “The system’s inappropriate behavior under failed structures reduced the quality of the dialogues and worsened the evaluation of the system.”
(page 1, line 24) “By switching interaction structures to adapt to users’ demands, system behavior be-comes more appropriate for users.”
Comments: The introduction should be brief, stating the main issue and context, and what your objectives are. Most of the intro should be in a separate literature review following the intro. Your intro should not tell the results: the ending. The general aim would be in the intro, and the finalized research questions would be at the end of the lit review.
Response: Thank you for pointing this out. We have added the following text to indicate general aim and separated the next paragraph as literature review:
(page 1, line 35) “Dialogue systems are expected to build relationships with humans as partners.”
We have added the text to enhance the coherence between paragraphs regarding literature review as follows:
(page 2, line 64) “Language is a way to influence human behavior in social interaction.”
We have corrected the text to indicate our aim at the end of literature review as follows:
(page 2, line 86) “This study aims to clarify how systems interact socially with humans in order to improve relationships.”
We have reorganized the text to briefly mention the conclusions as follows:
(page 3, line 94) “The experiment was conducted under the hypothesis that the progressive development of interaction structures in the system settings would improve interaction and provide favorable dialogues. The hypothesis was partially confirmed; the system received positive evaluations on well-constructed interaction structures. The structure in the system setting needed to match the structure that the subject was oriented toward.”
Comments: You should also explain dyadic and triadic interaction structures in the lit review, not in the methodology.
Response: Thank you for pointing this out. We have added the following text to explain:
(page 2, line 89) “The settings follow the change of interaction structures from dyadic to triadic in developmental stages of children regarding social communication, as shown in the field of developmental science [20].”
Comments: Your methodology should start with stating how you are approaching the issue (having participants interact with a GAI conversation), who are participants are and how you chose them, then how you will measure the interaction, and THEN explaining the tool and how you chose it. Then how it works.
Response: Thank you for pointing this out. We would like to explain that our issue was mainly on the system itself, so the methodology starts with an explanation of the system. The experiment was designed to investigate how the system interacts with humans. The subjects participated in the experiment with normal communication ability as healthy adults. We have added the following text to provide more information regarding the selection of the subjects:
(page 7, line 216) “It was confirmed in advance that they do not feel specific difficulty with communication.”
Comments: All your subjects are adults, which does not align with your aim to see how interactions are related to child development. You should have focused your study on ADULT interaction behaviors, not children.
Response: Thank you for pointing this out. As you commented, our study examined interactions between the adult subjects and the dialogue system in developing interaction settings. We have added the text to explain the system role as follows:
(page 4, line 146) “The dialogue system interacted in the role of the Child shown in Figure 3.”
Sincerely,
Reviewer 3 Report (Previous Reviewer 3)
Comments and Suggestions for AuthorsThe authors have improved the most important parts of the paper to increse it's quality. I summarize the improvements as follows.
Comment 1: The authors have improved the state-of-the-art section by integrating recent works.
Comment 2: The authors have not added the suggested images. However, other images have been improved.
Comment 3: The authors have not improved the study by increasing the sample size of the sample data. However, the discussion explains this paper's limitations.
Comment 5: The authors have improved participant details in Section 2.6.
Comment 6: The authors have not improved figure quality by redesigning all images. This is suggested, but not mandatory.
Author Response
Dear reviewer,
Thank you very much once again for taking the time to review our manuscript. We appreciate your comments on the improvements made in the revised version of our paper.
Sincerely,
Reviewer 4 Report (New Reviewer)
Comments and Suggestions for AuthorsPlease see the attachment for details.
Comments for author File: Comments.pdf
Author Response
Dear reviewer,
Thank you very much once again for taking the time to review our manuscript. We appreciate these valuable comments. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.
Comments 1: The concept of “affective computing” in the original text is not explained in the translation. It would be helpful to include a brief definition (e.g., “Affective computing refers to methods of identifying and responding to users’ emotional states through computational techniques”) to aid reader comprehension.
Response 1: Thank you for pointing this out. We have added the explanation and revised the relevant text as follows:
(page 12, line 398) “Emotional expression functions should be explored with affective computing technologies, including human emotion recognition and sentiment analysis, which aim to identify and express emotions and respond intelligently to human emotions [29].”
Comments 2: When referencing “Helsinki Declaration” or “Kobe University’s Graduate School of System Informatics research ethics,” it would be helpful to add a brief contextual explanation. This would assist readers unfamiliar with these references to understand their relevance.
Response 2: Thank you for pointing this out. We have added the following explanation:
(page 7, line 219) “The experiment involving human participants was conducted with careful consideration in accordance with the Research Ethics at Kobe University.”
Comments 3: Sample Size Insufficiency: The manuscript states that the experiment involved only 13 subjects. This small sample size limits the statistical power and generalizability of the findings. It is recommended that future studies involve a larger, more diverse participant pool in terms of age, gender, and background to improve reliability and robustness.
Response 3: Thank you for pointing this out. The study is at the beginning stage of research on relationships between humans and emerging human-like systems. As you commented, we would like to expand the survey and accumulate more results. We have added the following text in the conclusion section:
(page 13, line 418) “On the other hand, interactive systems targeting specific users, such as children and the elderly, are needed and should be researched [33,34]. Results different from those in this study could emerge for specific groups, and research concerning these aspects is also required.”
Comments 4: Refinement of Dialogue Quality Metrics: The current dialogue evaluation metrics— “fun,” “it listens,” and “smooth conversation”—are rather broad. Consider introducing more specific evaluation criteria, such as clarity of the system’s verbal output, response time, and contextual accuracy. This would provide a more comprehensive assessment of dialogue quality.
Response 4: Thank you for pointing this out. We have added the following text about future work in the conclusion section:
(page 13, line 416) “In the examination, it plans to consider emotions as an important factor and include quantitative data from temporal and biometric measurements to enhance validity.”
Comments 5: Control of Experimental Order: While the manuscript mentions that the order of experimental conditions was randomized, it does not specify the exact randomization method. Clearly explaining how the sequence was balanced would ensure that order effects do not significantly skew the results. Additionally, including a control group (e.g., a fixed interaction setting sequence) could help further validate the impact of order on outcomes.
Response 5: Thank you for pointing this out. We have added the following explanation:
(page 7, line 233) “The six possible orders for the three settings were assigned to the subjects in the order of their participation in the experiment.”
Comments 6: Standardization of Emotion Assessment: Happiness and sadness scores were generated by a generative AI model. However, no validation was provided to confirm their accuracy. Introducing an independent verification method—such as human evaluators rating the emotional content of some dialogues and comparing these ratings to the AI-generated scores—could enhance the reliability of the emotional assessment.
Response 6: Thank you for pointing this out. We have added the numerical results for the emotional expression function to explain the operation at the end of Section 3.3, revised the relevant text in the discussion section, and added the following text (the same as in Response 4) in the conclusion section as follows:
(page 10, line 310) “The facial expression changed 18.0 ± 4.4 times during the three dialogues for each subject in the experiment, and the number of displays for each expression level was as follows, from the most negative to the most positive: 0, 2.5 ± 1.7, 4.2 ± 1.9, 7.1 ± 2.1, and 4.2 ± 1.4 times.”
(page 12, line 395) “In addition, the emotional expression function was active with mainly positive expressions during the dialogues. Although the effect in non-emotional dialogues was merely that the system was perceived as natural in this study, handling emotions is important in terms of addressing individuals.”
(page 13, line 416) “In the examination, it plans to consider emotions as an important factor and include quantitative data from temporal and biometric measurements to enhance validity.”
Sincerely,
Round 2
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for Authorsl. 64: added sentence doesn't add to meaning.
l. 86: Are the relationships between humans or between humans and systems?
l. 95: you are not supposed to tell the conclusion in the introduction.
2.Methods: Start this section with your overall strategy and the basis for it.
You have no lit review on child development and your participants are all adults.
l. 200: what was the research bas for your questionnaire? Was it pilot-tested and verified?
I see little substantive changes.
Author Response
Dear reviewer,
Thank you very much once again for taking the time to review our manuscript. We appreciate these valuable comments. Please find the detailed responses below and the corresponding revisions highlighted in the re-submitted files.
Comments: l. 64: added sentence doesn't add to meaning.
Response: Thank you for your comment. We would like to explain that this sentence is intended to improve the flow from the previous paragraph into the one explaining social interaction.
Comments: l. 86: Are the relationships between humans or between humans and systems?
Response: Thank you for pointing this out. We have added the text:
(page 2, line 90) “between humans and systems”
Comments: l. 95: you are not supposed to tell the conclusion in the introduction.
Response: Thank you for your comment. Since a brief mention of conclusions is recommended in the template for this manuscript, we would like to include the conclusion in the introduction section this time.
Comments: 2.Methods: Start this section with your overall strategy and the basis for it.
Response: Thank you for your comment. We would like to explain our manuscript structure. Our overall strategy and the basis are presented in the introduction section. The basis is in the literature review, and the overall strategy based on it is in the paragraph near the methods section. Then, the detailed methodology is described in the methods section. The methods section focuses on providing concrete descriptions that enable others to replicate and build on the published results.
Comments: You have no lit review on child development and your participants are all adults.
Response: Thank you for your comment. We would like to explain that previous research on child development is cited as Reference 20, and previous research on developmental robotics is cited as Reference 21 in the literature review of the introduction. We have added the following explanation based on this research:
(page 2, line 82) “Applying human-like social interaction behaviors in conversation to dialogue systems is expected to help build trustworthy relationships with humans. Particularly, the essential social behaviors, which are observed during the developmental stages of children, would be necessary for systems.”
In addition, our study examined interactions between humans and the dialogue system. The subjects were required to have normal communication ability as healthy adults, and they were confirmed to meet the requirement.
Comments: l. 200: what was the research bas for your questionnaire? Was it pilot-tested and verified?
Response: Thank you for your comment. We would like to explain that previous research on the attractiveness factors of a friend, cited as Reference 27, and previous research on the impression of the communication robot, cited as Reference 28, were the basis of our questionnaire. The development is described in Section 2.5, Questionnaire. We conducted pilot experiments to confirm that the interaction and the evaluation using the questionnaire are without difficulty for adults.
The study is at the beginning stage of research on relationships between humans and emerging human-like systems. The verification of the evaluation items is also an ongoing issue in our studies. We consider that the evaluation results in this study from the questionnaire are reliable for discussing the interaction between humans and the system. If you are interested, we would be pleased if you could take a look at our previous paper, cited as Reference 12, which discusses the questionnaire items in detail. Additionally, we plan to include quantitative data from temporal and biometric measurements to enhance validity in future work and mention it in the conclusion section.
Sincerely,
Reviewer 4 Report (New Reviewer)
Comments and Suggestions for AuthorsAccept
Author Response
Dear reviewer,
Thank you very much for continuously taking the time to review our manuscript. We appreciate your valuable comments throughout the revision process.
Sincerely,
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsEvaluating generative AI dialogue system, especially in complex triadic interactions, is a valuable research topic. Focusing on subjective human evaluation provides crucial insights for advancing conversational AI.
The study is done only with a laptop and simple tools and this is very good.
However, please take into consideration thsese points:
Form :
1- Figures should be clear and readable (high quality images).
2- Figure 8 comes in the middle of the 3.3 section text. Please, put it in the end to give the reader the ability to read and understand before looking to the figue.
Content :
3- Five levels emotional expression in figure 2 are good idea for feedback but epxressions results are not uesed in the final conclusions.
4- Conclusions in Discussions section (lines 295, 296 and 297) are good conclusions but not based on a scientific process. Please explain how did you get those conclusions based on the results that you got.
(When a user wants to be listened to for relaxation, a system should behave in
dyadic setting as a casual listener following casual changes of topics. By switching inter
action structures to adapt to users’ demands, the behavior of systems becomes more appropriate for users.)
Not applicable.
Reviewer 2 Report
Comments and Suggestions for AuthorsYour introduction should be shorter and to the point; much of your intro could be in a separate literature review section.
Define dyadic and triadic interaction the first time you use those terms.
Please explain how the facial expressions are generated (e.g., AI guess as to appropriate emotion? programmed in?). Are all the emotions shown congruent with the text? (e.g., might a sad story have a smiley face?)
Or is the expression shown just in response to the person's voice and/or wording? How accurate is the expression?
If emotions were not the topic, why did you choose to show interaction through facial emotions?
How do you and your subjects define a "good dialogue"? Wouldn't it make more sense to base your criteria on research rather than on 13 healthy adults? (and how were they selected?)
How do you get the subject to respond to the starting interaction?
What were the 3 types of partners? I would think the subjects would want to know what you mean.
If the interaction continues for 5 minutes, might the reaction of the subject person change during those 5 minutes?
Technically, with a Lickert scale you shouldn't do averages; you can do medians.
It doesn't seem that the smiley face had any positive impact, especially if there are just 5 options. It would be interesting to see the difference between a Zoom interaction with and without faces. Or having a realistic avatar with AI instead of the smiley face.
In the triadic situation, did the subjects know each other? Did they see each other? Did they like each other? What if they were in different spaces but could interact via voice with each other and the machine?
How much did the 2 subjects talk with each other rather than to the machine? What was the balance in interacting (did 1 person do most of the talking? - or do more initiating and the other person more responding?)
People as well as machines can lack social skills. AI tools are learning social protocols; consider the type of robot that is used to teach kids with ASD how to interact. But it is sort of obvious that a person would react differently with an AI tool than a human. By the way, what about the Japanese males who prefer an AI girlfriend to a human one?
Your conclusion isn't solidly based just on your findings.
Your title is misleading. I see nothing on developmental stages of children -- or the use of children in the study.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors and editors, I will use the MDPI reviewer's suggestions format to evaluate and review this research work, as can be observed below. Please, carefully respond to each of the following comments that can be found within each question.
- What is the main question addressed by the research?
The authors studies how different interaction methods such as dyadic and triadic affect the subjective evaluation of dialogues and the impressions of a generative AI-driven dialogue system. The authors empathized, modeled after children's developmental stages in social communication. The study explores whether the application of these interaction structures can enhance user perceptions and trust in the system (laptop PC).
- What parts do you consider original or relevant for the field? What specific gap in the field does the paper address?
The authors combines psychology and human-computer interaction, by using dyalic and triadic interaction structures (inspired in children’s social communication cues) to design and evaluate dialogue systems. This part is interesting from the point of view of human computer interaction.
- What does it add to the subject area compared with other published material?
The gap that the authors face is related to understanding how interaction structures influence user perceptions and trust in dialogue systems. Other works have studied dialogue system design, user social cues, user experience, cognitive factors, among others. However, the authors emphasizes communication based on dyadic vs. triadic methods using subjective evaluations, which is an important contribution to psychology and HCI.
- What specific improvements should the authors consider regarding the methodology? What further controls should be considered?
Comment 1: The state of the art is very short and must be improved.
Comment 2: Please, add a proposed methodology figure, in which each stage of the proposed approach is shown in detail in a single image.
Comment 3: Please, indicate statistically why using only 13 subjects is good enough for this study. What is the trust level of the analysis in that size? Moreover, I think that using subjects of ages of 35.8 ± 14.4 can be biased for this study. Since the author’s contributions are not to present a novel artificial intelligence architecture, their contribution should be the user experiments, which may need more than 13 participants with age range more extended.
Comment 5: More detailed information about the participants is required. Are they studens, academics, field workers? What is cultural variability, individual differences, among others? A detailed description of the participants is required to improve this work.
Comment 6: The figures’ quality needs to be improved. It seems that all the images were copied from other textbooks.
Comment 7: There is no formal statistical method, or machine learning approach to analyze the proposed method scientifically with high standards of rigor. Moreover, the experiment seems to need more subjects to make inferences, and more social, cognitive, and psychological cues to be analyzed to present a substantial contribution.
- Please describe how the conclusions are or are not consistent with the evidence and arguments presented. Please also indicate if all main questions posed were addressed and by which specific experiments.
In general, the conclusions fit the methodology used in this work. The authors emphasize that well-constructed interactions can improve dialogues and impressions of the generative network, improving trustworthy. The authors evaluated dyadic and triadic settings, and demonstrated how interaction structures affect dialogue evaluation and system impressions. However, the lack of subjects for the experiments, as well as lack of methodological rigor, makes this article’s contribution hard to see. The limited generalizability and the lack of a real world context or application are problems that are evident in this work. The variability of the the subjects. In particular, in triadic setting evaluations suggests that the results might be influenced by the subjects. The high-rating results obtained for single subjects in the dyadic experiment is hard to generalize with this small sample of data.
- Are the references appropriate?
The references are only a few. This paper need more references.
- Please include any additional comments on the tables and figures and quality of the data.
The figures’ quality needs to be improved. It seems that all the images were copied from other textbooks.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors!
You have chosen an interesting topic to conduct a study but it is highly turbulent at the same time as well. Consider the fact that the first version of gpt-3.5-turbo was eligible around the end of 2022 and gpt-4 around 2023 March and these had several versions as well. Therefore, mentioning these versions of the models is highly important in which you mostly did a good job.
However, the study has several critical shortcomings that, in my opinion, prevents it from being suitable for publication in a high-impact journal. The main reasons are as follows:
* The study relies on the outdated gpt-3.5-turbo-0613, while more advanced multimodal models like gpt-4o are now available. This significantly reduces the relevance and impact of the findings.
* The sample of 13 participants is too small for meaningful conclusions.
* The study contains frequent grammatical errors and unclear phrasing, making it difficult to follow key arguments. A thorough English revision is necessary.
Please find my remarks below.
L. 40 the paper cited states the exact GPT-4 version ("ChatGPT-4 (from June 2023)"). This should be included in the paper as well.
Language in which the experiment was conducted? This should appear in Section 2-1.
What was the reason of using such an old and out-dated generative model?
Which program(s) have you used to perform the speech to text text to speech conversions?
In my opinion, by using an old generative model and a speech to text and text to speech converters the results are highly likely to underperform currently available multimodal models like gpt-4o.
L. 128 previous experiments are mentioned, cite if possible
L. 141-143, 177-178, 185-186, 199-200, 222, 226-227 please add commas between each category
I am not sure if I understand well, according to Table 1 and Dyadic interaction (with topic) setting you did not apply any generative models? Because the Processing part does not seem to be a prompt at all but mentions that answers were created beforehand.
... I got this by the end of the paper but it was not easily understandable.
Comments on the Quality of English Language
The quality of English makes it often hard to understand the points of the authors.
L. 72 received with good impressions-> received good feedbacks
L. 156. asks the subject