InterviewBot: Real-Time End-to-End Dialogue System for Interviewing Students for College Admission

InterviewBot: Real-Time End-to-End Dialogue System for Interviewing Students for College Admission

Article Menu

Article Menu

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Submit to this Journal Review for this Journal Propose a Special Issue

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Information 2023, 14(8), 460; https://doi.org/10.3390/info14080460

by Zihao Wang^1,*, Nathan Keyes², Terry Crawford² and Jinho D. Choi^1,*

Reviewer 1: Anonymous

Reviewer 2:

Nha H. Nguyen

Reviewer 3: Anonymous

Information 2023, 14(8), 460; https://doi.org/10.3390/info14080460

Submission received: 6 July 2023 / Revised: 31 July 2023 / Accepted: 8 August 2023 / Published: 15 August 2023

(This article belongs to the Special Issue Feature Papers in Information in 2023)

Round 1

Reviewer 1 Report

InterviewBot dynamically integrates conversation history and customized topics into a consistent embedding space for 10-minute hybrid conversations (open and closed).

To build an end-to-end neural dialogue model, 7,361 audio recordings of human interviews were automatically transcribed, of which 440 were manually corrected for fine-tuning and evaluation. To overcome the input/output size limitation of a transformer-based coder-decoder model, two new methods, namely context awareness and topic storage, are proposed to enable the model to interact in a relevant and consistent manner.

The final model was tested both statistically by comparing its responses with interview data.

The integration of conversation history and topic customization enables the InterviewBot to provide a highly effective and targeted interview experience. The model demonstrates remarkable fluency in responses and a good understanding of context. In addition, attention to context and topic storage help ensure that interactions are relevant and consistent.

I recommend that you consider this paper in your background: 2076-3417/13/2/677.

Both statistical and dynamic tests demonstrate respondents' high satisfaction with the InterviewBot.

Author Response

Thank you for your comment!

Reviewer 2 Report

The authors proposed InterviewBot for conducting hybrid-domain conversations with foreign students applying to U.S. colleges. They also proposed two new methods, context attention and topic storing, to address input/output size limitations in transformer-based models. The authors clearly show the diversity and details of the interviewees in the dataset and the solution for diarization errors. The results of the InterviewBot are also promising, even though there are still limitations.

In principle, the approach presented merits a favorable review, but a couple of major issues would need to be resolved before it could be accepted.

Major comments:

● For static evaluation, authors should consider some alternatives to capture the model's capability in conducting long contextualized interviews.

● For real-time evaluation, authors should provide more information about 5 experts who evaluate the results. Although over half of evaluators are satisfied, authors should provide details of the ratings of experts and students rather than vaguely combining the results into one table.

● Authors should also compare the InterviewBot with other state-of-art chatbots.

● There were previous attempts to incorporate memory into transformer models. For example, Memory Transformer (https://arxiv.org/abs/2006.11527), Recurrent Memory Transformer (https://proceedings.neurips.cc/paper_files/paper/2022/file/47e288629a6996a17ce50b90a056a0e1-Paper-Conference.pdf#page10), Memory transformer with hierarchical attention for long document processing (https://ieeexplore.ieee.org/document/9681776). These methods are similar in principle to the proposed topic-storing method in the paper, which is storing previous contexts to improve performance. Please explain what the novelty between the proposed method and previous works is.

● Dialogue Act Classification with Context-Aware Self-Attention (https://aclanthology.org/N19-1373/), Context and Knowledge Enriched Transformer Framework for Emotion Recognition in Conversations(https://ieeexplore.ieee.org/document/9533452) also considers previous utterances in classification. Is the context-attention method in the paper novel compared to the previous work?

● Please explain if there is any significant contribution/improvement/novelty about the speaker diarization model. Using a transformer to create a context embedding seems essential, but creating a context-aware mechanism is not new, and performance did not increase too much.

Minor Comment:

· Is it possible to share the code and data of the paper?

· grammar and writing should be checked in the whole paper.

Author Response

Thank you for your comments! Here are our responses:

Question: For static evaluation, authors should consider some alternatives to capture the model's capability in conducting long contextualized interviews.

Great point. We’ve spent some effort surveying current metrics. There are metrics other than the BLEU score are requiring extensive annotation on different perspectives of the generation. However, the main purpose of this paper is to explore how the dialogue flow could be fluent rather than diving into different perspectives of a conversation. We opt for the BLEU metric because it is a conventional measure extensively employed in prior research for machine translation and text generation. Nevertheless, we acknowledge that in hybrid-domain conversations, metrics such as the BLEU score may not provide a comprehensive evaluation for these tasks. As a result, we have incorporated human evaluations to supplement and enhance our assessment process.

Question: For real-time evaluation, authors should provide more information about five experts who evaluate the results. Although over half of evaluators are satisfied, authors should provide details of the ratings of experts and students rather than vaguely combining the results into one table.

This is a good point. We’ve updated the table in the paper. The detailed evaluation scores from professional interviewers and students are listed below:

Question: Authors should also compare the InterviewBot with other state-of-art chatbots.

We’ve integrated a couple of LLMs at the end of related work. The next advancement from language models like Blenderbot is large language models. However, there are still limitations associated with input lengths, how to integrate extra context into the models, etc. The main purpose of our paper is mainly on how to use context to make a better conversation flow while covering possible topics restricted in an interview setup and, at the same time, overcoming the input length limitation.

Question: There were previous attempts to incorporate memory into transformer models. For example, Memory Transformer, Recurrent Memory Transformer, Memory transformer with hierarchical attention for long document processing. These methods are similar in principle to the proposed topic-storing method in the paper, which is storing previous contexts to improve performance. Please explain what the novelty between the proposed method and previous works is.

Question: Dialogue Act Classification with Context-Aware Self-Attention, Context and Knowledge Enriched Transformer Framework for Emotion Recognition in Conversations also considers previous utterances in classification. Is the context-attention method in the paper novel compared to the previous work?

Great question. We’ve included descriptions to compare our context-based model and some previous memory-based models in the related work. The main point is that just utilizing the previous context is not enough, while ongoing conversations could refer to previously scattered keywords or sentences. We are more focused on the perspectives of dialogue flow, where it could be going deep into a topic or transitioning to a different topic. The topic storing is trying to avoid going back to topics that have been discussed even if there are concepts related to the current topic and also trying to transition to a different topic if needed.

Question: Please explain if there is any significant contribution, improvement, novelty about the speaker diarization model. Using a transformer to create a context embedding seems essential, but creating a context-aware mechanism is not new, and performance did not increase too much.

This is a good point. We decided to include this work for two reasons. The speaker diarization model has been mostly approached by analyzing audio or combined with text. We approached this problem only by transcription text. Second, the speaker diarization is a necessary procedure for the latter generation task.

Reviewer 3 Report

Summary

The authors describe a system, InterviewBot, designed to conduct 10-minute conversations with foreign students looking to apply to US colleges. InterviewBot is a neural-based dialogue model performing two basic tasks, diarization of audio streams into discrete segments for each user, and generation of the interviewer's utterances. To accomplish the latter task a commercial chatbot (BlenderBot, based on a Transformer architecture) is used, augmented by the authors with context attention and topic scoring, and fine-tuned using manually diarized interviews. Different aspects of the system performance are tested (diarization and BLEU score against a test dataset and a performance rating by several interviewers and students).

Pros

* The paper is well written and, in general, easy to understand (with some caveats, see below). The English used is mostly correct, as far as I can see.

* The topic is interesting and the solutions proposed by the authors are in line with current research. The research design seems appropiate and tests results indicate improvememt over base model performance in the measured areas.

Cons

* I tend to be somewhat skeptical about the potential of language models based on transformer architectures to do an adequate job of mimicking understanding and interaction, given that there is no real understanding on their part (as one would expect from a genuine general AI). But a lot of research is being devoted in testing the limits of this approach and my skepticism could be very well unwarranted.

* Section 2 (related work) is a little short on detail. While most of the references are devoted to current research, there is not a lot of explanation on what differentiates these approaches to this work. In order to make the paper a little more self-contained I would like to see short descriptions of the previous works referenced.

* I think it would be beneficial to the reader to see the technical contributions of this work (topic scoring, context attention, dataset) listed at the end of the introduction after the description of the chatbot, and a short description of what is going to be covered in the next sections, as it is customary in this kind of text.

* I am not sure what is the logic behind the context model in section 3.2.3. (the part of the architecture generating the O_u). If I understand correctly, is is a binary classification model predicting if there is a diarization error. If I am right, could you explain in the text your motivation for generating that output instead of just correcting diarization errors observed in the training/test datasets? And if I am wrong, could you explain what the context model really adds to the diarization system?

* If I understand correctly, the authors are using an sliding window with an overlap between consecutive positions (covering words w_{m+1} to w_{n}) in order to process utterances longer than the maximum segment length in the transformer architecture. I think it would be benefitial to the reader to be stated in the text like this (or in a similar way) instead of just relying on the mathematical representation of E^1 and E^2 and Fig. 6.

* The discussion in subsubsection 3.3.4 (Topic scoring) is confusing to me. What is the topical question set? A enumeration of conversation topics specific to the college interview task? Where do they come from? Are the interwiever utterances in the training data annotated with them? How? In addition, the text uses the phrase "first token" to designate the flag in each utterance signalling the beginning/end of a conversation, a topical question, or an utterance from the interviewee. I found that slightly confusing because it implies that the flag is part of the utterance, when it is really not.

* How are the 1xd embeddings of topical questions generated. As the average embedding of the tokens in the question? Please could you explain it in the text.

* I think there is an error in line 164. It says that the context model produces O_w, where it should be o_u (if my understanding of the paper is correct).

* I would like to see a little more technical information about the characteristics of the model, especially hyperparameters (type of optimizer, learning rate, droput, etc) and how have they been chosen (hyperparameter search or based on similar works). That kind of information could be useful for other researchers.

* In section 4.2.1 there should be a reference of the previous work the authors are following when they conduct an static evaluation.

* If I not mistaken, the results of Table 5 are computed using the chatbot on the (automatically) diarized test set from the interview data. Could you please state it unequivocally in subsection 4.2?

* I am skeptical of how well n-gram similarity measures like BLEU server as indicators of performance in this kind of task, which is why I am not giving a lot of importance to the metrics in subsubsection 4.2.1. But, since this kind of evaluation has been used in previous works, it makes sense to include it.

Author Response

Thank you for your comments!

Question: (related work) is a little short on detail. While most of the references are devoted to current research, there is not a lot of explanation on what differentiates these approaches to this work. In order to make the paper a little more self-contained I would like to see short descriptions of the previous works referenced.

We’ve updated the related work with descriptions of some previous work.

Question: I think it would be beneficial to the reader to see the technical contributions of this work (topic scoring, context attention, dataset) listed at the end of the introduction after the description of the chatbot, and a short description of what is going to be covered in the next sections, as it is customary in this kind of text.

Great point. We’ve updated the paper with the list of contributions in the introduction as well as the organization of the paper.

Question: I am not sure what is the logic behind the context model in section 3.2.3. (the part of the architecture generating the O_u). If I understand correctly, is is a binary classification model predicting if there is a diarization error. If I am right, could you explain in the text your motivation for generating that output instead of just correcting diarization errors observed in the training/test datasets? And if I am wrong, could you explain what the context model really adds to the diarization system?

Great point. We’ve updated the description with some explanation. The main point is that we want to enhance the correction of the diarization error by firstly making sure the correction is made on the targe sentence.

Question: If I understand correctly, the authors are using an sliding window with an overlap between consecutive positions (covering words w_{m+1} to w_{n}) in order to process utterances longer than the maximum segment length in the transformer architecture. I think it would be benefitial to the reader to be stated in the text like this (or in a similar way) instead of just relying on the mathematical representation of E^1 and E^2 and Fig. 6.

Good point. We added some words to describe what we do before the maths.

Question: The discussion in subsubsection 3.3.4 (Topic scoring) is confusing to me. What is the topical question set? A enumeration of conversation topics specific to the college interview task? Where do they come from? Are the interwiever utterances in the training data annotated with them? How? In addition, the text uses the phrase "first token" to designate the flag in each utterance signalling the beginning/end of a conversation, a topical question, or an utterance from the interviewee. I found that slightly confusing because it implies that the flag is part of the utterance, when it is really not.

We’ve highlighted the part that explains the topic questions. They are annotated by the data provider. The first tokens are used to indicate which speakers as well as on the role of the sentence, such as a questions or beginning or end of the conversation.

Question: How are the 1xd embeddings of topical questions generated. As the average embedding of the tokens in the question? Please could you explain it in the text.

Good point. The 1xd embeddings went through an abstraction process by feedforward layers after the encoding process so that they could be stacked easily.

Question: I think there is an error in line 164. It says that the context model produces O_w, where it should be o_u (if my understanding of the paper is correct).

Good catch. Thank you. Corrected.

Question: I would like to see a little more technical information about the characteristics of the model, especially hyperparameters (type of optimizer, learning rate, droput, etc) and how have they been chosen (hyperparameter search or based on similar works). That kind of information could be useful for other researchers.

We basically followed the use of parameters in the original blenderbot setup. We didn’t spend effort to optimize on that.

Question: In section 4.2.1 there should be a reference of the previous work the authors are following when they conduct an static evaluation.

Good point. We’ve added a reference which referenced some previous work that stated the use of BLEU scores.

Question: If I not mistaken, the results of Table 5 are computed using the chatbot on the (automatically) diarized test set from the interview data. Could you please state it unequivocally in subsection 4.2?

We used the output from the diarized model for text generation.

Question: I am skeptical of how well n-gram similarity measures like BLEU server as indicators of performance in this kind of task, which is why I am not giving a lot of importance to the metrics in subsubsection 4.2.1. But, since this kind of evaluation has been used in previous works, it makes sense to include it.

This question has been asked by ourselves and other reviewers. We are opting for the BLEU metric because it is a conventional measure extensively employed in prior research for machine translation and text generation. Nevertheless, we acknowledge that in hybrid-domain conversations, metrics such as the BLEU score may not provide a comprehensive evaluation for these tasks. As a result, we have incorporated human evaluations to supplement and enhance our assessment process.

Round 2

Reviewer 2 Report

The authors have addressed our questions

Reviewer 3 Report

All my concerns have been addressed. In my opinion, this paper is ready for publication.

Interviewer