Next Article in Journal
Social Media Analytics as a Tool for Cultural Spaces—The Case of Twitter Trending Topics
Next Article in Special Issue
Áika: A Distributed Edge System for AI Inference
Previous Article in Journal
A Novel Method of Exploring the Uncanny Valley in Avatar Gender(Sex) and Realism Using Electromyography
Previous Article in Special Issue
A New Comparative Study of Dimensionality Reduction Methods in Large-Scale Image Retrieval
 
 
Article
Peer-Review Record

Synthesizing a Talking Child Avatar to Train Interviewers Working with Maltreated Children

Big Data Cogn. Comput. 2022, 6(2), 62; https://doi.org/10.3390/bdcc6020062
by Pegah Salehi 1,*, Syed Zohaib Hassan 1, Myrthe Lammerse 2, Saeed Shafiee Sabet 1, Ingvild Riiser 2, Ragnhild Klingenberg Røed 2, Miriam S. Johnson 2, Vajira Thambawita 1, Steven A. Hicks 1, Martine Powell 3, Michael E. Lamb 4, Gunn Astrid Baugerud 2, Pål Halvorsen 1,2 and Michael A. Riegler 1,5
Reviewer 1: Anonymous
Reviewer 2:
Big Data Cogn. Comput. 2022, 6(2), 62; https://doi.org/10.3390/bdcc6020062
Submission received: 30 April 2022 / Revised: 20 May 2022 / Accepted: 21 May 2022 / Published: 1 June 2022
(This article belongs to the Special Issue Multimedia Systems for Multimedia Big Data)

Round 1

Reviewer 1 Report

The manuscript presents three subjective studies investigating and comparing various state-of-the-art methods for achieving multiple aspects of child avatars: 1) evaluating the overall system and showing that the system is well-received by experts and emphasizing realism; 2) the affective component and how it integrates with video and audio; 3) the authenticity of the auditory and visual components of avatars created by different methods. Insights and feedback from these studies have led to improvements and improved architectures for children's avatar systems. This is a well-structured and well-written paper. A large amount of related past work is given. The results are meaningful.

Author Response

May 19, 2022.

Dear Editor of Big Data and Cognitive Computing Journal

I have revised the manuscript according to the referees' comments and upload the revised. Then, I've written this cover letter to go through the specifics of the manuscript amendments and our responses to the referees’ comments.

 

 

Response to Reviewer 1 Comments:

 

The manuscript presents three subjective studies investigating and comparing various state-of-the-art methods for achieving multiple aspects of child avatars: 1) evaluating the overall system and showing that the system is well-received by experts and emphasizing realism; 2) the affective component and how it integrates with video and audio; 3) the authenticity of the auditory and visual components of avatars created by different methods. Insights and feedback from these studies have led to improvements and improved architectures for children's avatar systems. This is a well-structured and well-written paper. A large amount of related past work is given. The results are meaningful.

 

 

Response to Reviewer 2 Comments:

 

This paper is titled – “Synthesizing Talking Child-Avatar to Train Interviewers Working with Maltreated Children”. In the paper, the authors have presented three subjective studies that investigate and compare various state-of-the-art methods for implementing multiple aspects of the child avatar. The first user study evaluates the whole system and shows that the system is well received by the expert and highlights the importance of realism. The second user study investigates the emotional component and how it can be integrated with video and audio, and the third user study investigates the realism in the auditory and visual components of the avatar created by different methods. The work deals with an interesting research problem in this field and has the potential for impact to the discipline. However, the presentation of certain parts should be improved. It is suggested that the authors make the necessary changes/updates to their paper as per the following comments:

 

Point 1:  The authors mention that their chatbot was developed specifically to practice an investigative interviewing methodology following best practice guidelines and inspired by [11]. Briefly explain these guidelines and outline why they are considered best practices.

 

 

The best practice guidlines are explained in the “introduction” section, in addition we also added a brief explanation in the form of an example.

 


“These best practice investigative interview guidelines provide interviewers with clear instructions regarding how child witnesses should be questioned and supported in a non-suggestive way during the interview to maximize the value of their testimony [10 ,11]. In particular, this includes providing the children with open-ended prompts, avoiding forced-choice, and suggestive question types as free recall encourage more accurate and longer responses from children [12 ,13] (e.g., "What happened next?" where the child is provided with no information that could influence its answers). Also, avoid suggesting or leading questions (e.g., "the person touched your private part. Is it true?" or any other question that can lead the child to tell a specific story). However, unfortunately most interviewers do not adhere to such best practice guidelines [14].”

 

 

Point 2: In line 498, the authors mention about 39 participants being involved in the experiments. Details are missing here. Please provide details about how these participants were recruited. What were the inclusion/exclusion criteria? Did the diversity characteristics of the participants have any impact on the results?

 

Addressed by adding the following paragraph to the paper:

 

“The study was conducted through crowdsourcing and Microworker was used as a recruiting environment, with users referred to a questionnaire tool hosted on a separate server that contains videos. To ensure the validity and reliability of the collected data, only high-acclaim crowdworkers who performed the best on the test were invited to the test. Overall, 39 crowdworkers provided valid results in this study, including 10 women, 27 men, and two of other genders. The crowd workers aged 19 to 54 years ($Median=28$, $Mean=29.58$, $SD=8.36$) were geographically evenly distributed between Europe, Asia and America (North and South).”

 

Regarding the impact of human influencing factors: The number of samples within the characteristics is insufficient to scientifically investigate the impact of each characteristic (e.g., only two non-binary genders and ten females). Human influencing factors have been studied before, and it has been shown that they are complex and strongly interrelated. The characteristic can describe the demographic and socio-economic background, the physical and mental constitution, or the user’s emotional state, which can have an invariant character  [QoE Whitepaper]. Therefore, investigating the impact of these characteristics requires a very high number of samples within each category, which was out of the scope of the current work.

We also included the following paragraph to the “Discussions and future work”.


“Further, future works explore the human influencing factors and their impact on the user experience. The human influencing factors can describe the demographics such as age, gender and socio-economic background, the physical and mental constitution, or the user’s emotional state [102], that can have an impact on the user experience.”

 

 

Point 3: Discussion of results needs improvement: Please compare the findings (both quantitatively and qualitatively) with prior works in this field to discuss how either the results obtained have never been obtained before or how the findings outperform prior works in this field.

 

The proposed system is entirely different from previous works and is not comparable to them. In “Child interview training avatars” subsection of the related work, we mentioned two avatars that were created to empower the interviewer's training.

 

“Empowering Interviewer Training (EIT) [82] is an investigative interview training program. Child responses are pre-defined in the system and responses are selected using a rule-based algorithm. Based on the selected response, prerecorded videos of children showing different emotions are chosen by a human operator and shown to the user.

In Sweden, Linnæus university and AvBIT Labs have also introduced an online interview training system. They also use prerecorded audio responses and videos of a child avatar and human operator. User is shown an appropriate video response with suitable emotion controlled using Wirecast software controls via the Skype interface [83, 84].

Even though development and testing of these systems has shown that they can help transfer the investigative interviewing skills of abused children, these systems are not dynamic in the response generation and have human input during the response selection phase.

This makes them rigid and harder to operate.”

 

Also, addressed by adding the following paragraph to the “Discussions and future work” section.

 

“The proposed system can dynamically respond to the questions and provide a higher level of realism during the training interviews and would be completely independent of human input. Therefore, unlike previous systems that were too rigid in their response generation, lacked generality, and required human input, our proposed system can generate dynamic responses without human operation. Such a system would be more dynamic and cost-effective because it lowers expenditures on human resources.”

 

 

Point 4: Multiple old references in Section 2.4: In this section several of the references that have been cited to support the fact-based statements are old. For instance, in this sentence – “With the recent advances in function of CNNs [45] and recurrent neural networks (RNNs)……” the paper cited (reference [45]) is 8 years old. Update the old references in this section wherever applicable.

 

  • The reference [45] was replaced by a newer paper (Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention), and we also added an almost new reference to RNN (Speech recognition using recurrent neural networks).
  • The reference to Wavenet was replaced by two newer references (models such as Wavenet [55, 56] and....).
  • The reference to LibriSpeech was replaced by a newer paper (Benchmarks such as LibriSpeech [48] and...)

 

 

Point 5: From Figure 10, it can be observed that the quality ratings of “Sarah” are considerably less compared to the ratings of the rest of the characters. Please explain this observation including possible reasons for the same.

 

As was shown in the paper, there is no significant difference between Sarah and the other three avatars, (Talking p=.12, Appearance p=.11, and Overall experience p=.30). However, by observing the source videos it can be observed that the LipSync generation in Sarah was not as good as the other three avatars, which might be the reason for the slight differences between Sarah and the other three avatars, but again these differences were not statistically significant!

For clarification, the following text has been added to the text.


“Sarah's character was rated slightly lower than the others, which might be due to weaker lip synchronization, but there was no statistically significant difference.”

 

 

Point 6: There are a few spelling mistakes throughout the paper. For instance, in line number 106, “feedback” is spelled as “feedbake”.

 

We checked the whole text to correct any possible errors.

 

 

 

 

Yours Sincerely,

Pegah Salehi

Simula Metropolitan Center for Digital Engineering,

Tel. +4792097694

Email: [email protected]

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper is titled – “Synthesizing Talking Child-Avatar to Train Interviewers Working with Maltreated Children”. In the paper, the authors have presented three subjective studies that investigate and compare various state-of-the-art methods for implementing multiple aspects of the child avatar. The first user study evaluates the whole system and shows that the system is well received by the expert and highlights the importance of realism. The second user study investigates the emotional component and how it can be integrated with video and audio, and the third user study investigates the realism in the auditory and visual components of the avatar created by different methods. The work deals with an interesting research problem in this field and has the potential for impact to the discipline. However, the presentation of certain parts should be improved. It is suggested that the authors make the necessary changes/updates to their paper as per the following comments:

  1. The authors mention that their chatbot was developed specifically to practice an investigative interviewing methodology following best practice guidelines and inspired by [11]. Briefly explain these guidelines and outline why they are considered best practices.
  2. In line 498, the authors mention about 39 participants being involved in the experiments. Details are missing here. Please provide details about how these participants were recruited. What were the inclusion/exclusion criteria? Did the diversity characteristics of the participants have any impact on the results?
  3. Discussion of results needs improvement: Please compare the findings (both quantitatively and qualitatively) with prior works in this field to discuss how either the results obtained have never been obtained before or how the findings outperform prior works in this field.
  4. Multiple old references in Section 2.4: In this section several of the references that have been cited to support the fact-based statements are old. For instance, in this sentence – “With the recent advances in function of CNNs [45] and recurrent neural networks (RNNs)……” the paper cited (reference [45]) is 8 years old. Update the old references in this section wherever applicable.
  5. From Figure 10, it can be observed that the quality ratings of “Sarah” are considerably less compared to the ratings of the rest of the characters. Please explain this observation including possible reasons for the same.
  6. There are a few spelling mistakes throughout the paper. For instance, in line number 106, “feedback” is spelled as “feedbake”.

Author Response

May 19, 2022.

Dear Editor of Big Data and Cognitive Computing Journal

I have revised the manuscript according to the referees' comments and upload the revised. Then, I've written this cover letter to go through the specifics of the manuscript amendments and our responses to the referees’ comments.

 

 

Response to Reviewer 1 Comments:

 

The manuscript presents three subjective studies investigating and comparing various state-of-the-art methods for achieving multiple aspects of child avatars: 1) evaluating the overall system and showing that the system is well-received by experts and emphasizing realism; 2) the affective component and how it integrates with video and audio; 3) the authenticity of the auditory and visual components of avatars created by different methods. Insights and feedback from these studies have led to improvements and improved architectures for children's avatar systems. This is a well-structured and well-written paper. A large amount of related past work is given. The results are meaningful.

 

 

Response to Reviewer 2 Comments:

 

This paper is titled – “Synthesizing Talking Child-Avatar to Train Interviewers Working with Maltreated Children”. In the paper, the authors have presented three subjective studies that investigate and compare various state-of-the-art methods for implementing multiple aspects of the child avatar. The first user study evaluates the whole system and shows that the system is well received by the expert and highlights the importance of realism. The second user study investigates the emotional component and how it can be integrated with video and audio, and the third user study investigates the realism in the auditory and visual components of the avatar created by different methods. The work deals with an interesting research problem in this field and has the potential for impact to the discipline. However, the presentation of certain parts should be improved. It is suggested that the authors make the necessary changes/updates to their paper as per the following comments:

 

Point 1:  The authors mention that their chatbot was developed specifically to practice an investigative interviewing methodology following best practice guidelines and inspired by [11]. Briefly explain these guidelines and outline why they are considered best practices.

 

 

The best practice guidlines are explained in the “introduction” section, in addition we also added a brief explanation in the form of an example.

 


“These best practice investigative interview guidelines provide interviewers with clear instructions regarding how child witnesses should be questioned and supported in a non-suggestive way during the interview to maximize the value of their testimony [10 ,11]. In particular, this includes providing the children with open-ended prompts, avoiding forced-choice, and suggestive question types as free recall encourage more accurate and longer responses from children [12 ,13] (e.g., "What happened next?" where the child is provided with no information that could influence its answers). Also, avoid suggesting or leading questions (e.g., "the person touched your private part. Is it true?" or any other question that can lead the child to tell a specific story). However, unfortunately most interviewers do not adhere to such best practice guidelines [14].”

 

 

Point 2: In line 498, the authors mention about 39 participants being involved in the experiments. Details are missing here. Please provide details about how these participants were recruited. What were the inclusion/exclusion criteria? Did the diversity characteristics of the participants have any impact on the results?

 

Addressed by adding the following paragraph to the paper:

 

“The study was conducted through crowdsourcing and Microworker was used as a recruiting environment, with users referred to a questionnaire tool hosted on a separate server that contains videos. To ensure the validity and reliability of the collected data, only high-acclaim crowdworkers who performed the best on the test were invited to the test. Overall, 39 crowdworkers provided valid results in this study, including 10 women, 27 men, and two of other genders. The crowd workers aged 19 to 54 years ($Median=28$, $Mean=29.58$, $SD=8.36$) were geographically evenly distributed between Europe, Asia and America (North and South).”

 

Regarding the impact of human influencing factors: The number of samples within the characteristics is insufficient to scientifically investigate the impact of each characteristic (e.g., only two non-binary genders and ten females). Human influencing factors have been studied before, and it has been shown that they are complex and strongly interrelated. The characteristic can describe the demographic and socio-economic background, the physical and mental constitution, or the user’s emotional state, which can have an invariant character  [QoE Whitepaper]. Therefore, investigating the impact of these characteristics requires a very high number of samples within each category, which was out of the scope of the current work.

We also included the following paragraph to the “Discussions and future work”.


“Further, future works explore the human influencing factors and their impact on the user experience. The human influencing factors can describe the demographics such as age, gender and socio-economic background, the physical and mental constitution, or the user’s emotional state [102], that can have an impact on the user experience.”

 

 

Point 3: Discussion of results needs improvement: Please compare the findings (both quantitatively and qualitatively) with prior works in this field to discuss how either the results obtained have never been obtained before or how the findings outperform prior works in this field.

 

The proposed system is entirely different from previous works and is not comparable to them. In “Child interview training avatars” subsection of the related work, we mentioned two avatars that were created to empower the interviewer's training.

 

“Empowering Interviewer Training (EIT) [82] is an investigative interview training program. Child responses are pre-defined in the system and responses are selected using a rule-based algorithm. Based on the selected response, prerecorded videos of children showing different emotions are chosen by a human operator and shown to the user.

In Sweden, Linnæus university and AvBIT Labs have also introduced an online interview training system. They also use prerecorded audio responses and videos of a child avatar and human operator. User is shown an appropriate video response with suitable emotion controlled using Wirecast software controls via the Skype interface [83, 84].

Even though development and testing of these systems has shown that they can help transfer the investigative interviewing skills of abused children, these systems are not dynamic in the response generation and have human input during the response selection phase.

This makes them rigid and harder to operate.”

 

Also, addressed by adding the following paragraph to the “Discussions and future work” section.

 

“The proposed system can dynamically respond to the questions and provide a higher level of realism during the training interviews and would be completely independent of human input. Therefore, unlike previous systems that were too rigid in their response generation, lacked generality, and required human input, our proposed system can generate dynamic responses without human operation. Such a system would be more dynamic and cost-effective because it lowers expenditures on human resources.”

 

 

Point 4: Multiple old references in Section 2.4: In this section several of the references that have been cited to support the fact-based statements are old. For instance, in this sentence – “With the recent advances in function of CNNs [45] and recurrent neural networks (RNNs)……” the paper cited (reference [45]) is 8 years old. Update the old references in this section wherever applicable.

 

  • The reference [45] was replaced by a newer paper (Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention), and we also added an almost new reference to RNN (Speech recognition using recurrent neural networks).
  • The reference to Wavenet was replaced by two newer references (models such as Wavenet [55, 56] and....).
  • The reference to LibriSpeech was replaced by a newer paper (Benchmarks such as LibriSpeech [48] and...)

 

 

Point 5: From Figure 10, it can be observed that the quality ratings of “Sarah” are considerably less compared to the ratings of the rest of the characters. Please explain this observation including possible reasons for the same.

 

As was shown in the paper, there is no significant difference between Sarah and the other three avatars, (Talking p=.12, Appearance p=.11, and Overall experience p=.30). However, by observing the source videos it can be observed that the LipSync generation in Sarah was not as good as the other three avatars, which might be the reason for the slight differences between Sarah and the other three avatars, but again these differences were not statistically significant!

For clarification, the following text has been added to the text.


“Sarah's character was rated slightly lower than the others, which might be due to weaker lip synchronization, but there was no statistically significant difference.”

 

 

Point 6: There are a few spelling mistakes throughout the paper. For instance, in line number 106, “feedback” is spelled as “feedbake”.

 

We checked the whole text to correct any possible errors.

 

 

 

 

Yours Sincerely,

Pegah Salehi

Simula Metropolitan Center for Digital Engineering,

Tel. +4792097694

Email: [email protected]

Author Response File: Author Response.pdf

Back to TopTop