InterviewBot: Real-Time End-to-End Dialogue System to Interview Students for College Admission

We present the InterviewBot that dynamically integrates conversation history and customized topics into a coherent embedding space to conduct 10 mins hybrid-domain (open and closed) conversations with foreign students applying to U.S. colleges for assessing their academic and cultural readiness. To build a neural-based end-to-end dialogue model, 7,361 audio recordings of human-to-human interviews are automatically transcribed, where 440 are manually corrected for finetuning and evaluation. To overcome the input/output size limit of a transformer-based encoder-decoder model, two new methods are proposed, context attention and topic storing, allowing the model to make relevant and consistent interactions. Our final model is tested both statistically by comparing its responses to the interview data and dynamically by inviting professional interviewers and various students to interact with it in real-time, finding it highly satisfactory in fluency and context awareness.


Introduction
With the latest advancement of Conversational AI, end-to-end dialogue systems have been extensively studied [1][2][3].One critical requirement is context awareness; robust dialogue systems must consider relevant parts in conversation history to generate pertinent responses [4][5][6][7][8].However, these systems still suffer from issues such as hallucination, inconsistency, or lacking commonsense [9], hindering them from taking place in real applications.
Numerous admission interviews are given every year to students located in 100+ countries applying to colleges in the U.S., where the interviews are often conducted online.Those interviews are usually unscripted, with an emphasis on asking the applicants thought-provoking questions based on their interests and experiences.The main objective is to provide decision-makers (e.g., admissions officers, faculty members) with an unfiltered look at those students in a daily academic environment.
Building an interview chatbot, called InterviewBot, will save time and effort for the interviewers and provide foreign students with a cost-efficient way of practicing interviews when native speakers are unavailable.Nonetheless, there are a few hurdles to developing an end-to-end InterviewBot.First, it is hard to collect a sufficient amount of data covering dialogues crossing open & closed domains (Section 3.1).Second, most transformer-based encoder-decoder models adapted by current state-of-the-art systems are not designed to handle long contexts; thus, they often repeat or forget previously discussed topics (Section 3.3).Third, it is demanding to find appropriate people to interactively test such a dialogue system with the professional objective (Section 4).
This paper presents an end-to-end dialogue system that interacts with international applicants to U.S. colleges.The system questions critical perspectives, follows up on the interviewee's responses for in-depth discussions, and makes natural transitions from one topic to another until the interview ends, which lasts about 30 turns (5 mins for text-based, 10 mins for spoken dialogues).To the best of our knowledge, it is the first real-time system using a neural model, completely unscripted, conducting such long conversations for admission interviews.Our technical contributions are summarized as follows: • We have developed a contextualized neural model designed to perform diarization tasks on text transcripts alone.

•
We have integrated a sliding window technique to overcome the input token limit and restore the completeness of the input in the latent space.

•
We have integrated extracted topics from the conversation to address issues related to topic repetition, off-topic discussions, and premature endings in conversations.
The remaining sections are organized as follows: Section 2 reviews current dialogue models, their applications, and limitations.Section 3 describes datasets and our speaker diarization and InterviewBot model architectures in detail.Section 4 gives experiment results on diarization and InterviewBot dialogue generation.Section 5 and Section 6 conduct discussions on the results and conclude the paper.

Related Work
Dialogue systems can be categorized into closed-and open-domain systems [10].Closed-domain systems require efficient access to domain knowledge [11] and serve specific professions such as education [12], healthcare [13,14], or customer service [15,16].Open-domain systems converse across multiple domains with natural transitions [2] and conduct interactions in a broader horizon [17][18][19].For admission interviews, however, the conversation is often a mixture of closed (jobrelated questions) and open-domain (general aspects of the applicant) dialogues, which makes it more challenging to build an end-to-end system.Several dialogue systems have been developed to communicate with humans for information exchange or elicitation across multiple domains [20][21][22].[19] built a conversational system to converse proactively on popular topics with Alexa users by providing them with the requested information as well as pre-crafted transitions.[23] established a virtual interviewer to study on the effect of personality on confiding and listening to virtual agents.[24] studies the role of a chatbot in a survey setup.Although these dialogue systems have shown their effectiveness in achieving their goals, they all heavily rely on design templates.Conversational agents for interviews have been experimented with for law enforcement [25], healthcare [26], job application [27], and psychology [28], among which most are proof of concept.A few interview bots have been developed on commercial platforms such as Google Dialogflow and IBM Watson Assistant, with the limitation of pre-scripted interviews; thus, they cannot proactively follow up on the user content.
Context and memory have been studied as key factors to affect model performance in contextheavy settings.[29] proposed a memory transformer to hierarchically employ memory to improve translation performance.However, in a more complex conversation setup, dialogue flow is not only about correlations between sentences or words in the semantics but rather the proceeding of conversations with a depth of topics and transitions to other topics.Other models such as [30] and [31] have proposed context and external knowledge-based models in conversation-related tasks.Although, the effort was proven to improve based on specific metrics, still not sufficient to improve the overall dialogue flow of conversations.
Deep language models, such as Blenderbot [3] and Bart [32], have taken context into consideration.However, the limitation on the length of input tokens as well as conversation history has bottlenecked their applications in the real world.Recent surges of large language models, such as ChatGPT [33] and LLaMa [34], have shown strong evidence of improvement with respect to context integration.Nevertheless, there are always limitations on the input length, as well as effective ways of integrating different contexts into a language model.

Interview Dataset
Audio recordings of 7,361 interviews are automatically transcribed with speaker identification by the online tool RevAI, 1 where 440 are manually corrected on speaker ID assignment for finetuning and evaluation of our models (Table 1).Each recording contains an average of a ≈15-min long dialogue between an interviewer and an interviewee.The interviews were conducted by 67 professionals in 2018 -2022.The largest age group of interviewees is 18-years-old with 59.3%, followed by 17-years-old with 29.4%.The male-to-female ratio is 1.2:1.The major country of origin is China with 81.4% followed by Belgium with 10.5%, alongside 37 other countries.Table 1 provides detailed demographics of the interviewees.
All recordings are transcribed into text and speakers are identified automatically.For speech recognition, three tools from Amazon,2 Google, 3 and RevAI4 are assessed on 5 recordings for speaker diarization, achieving the F1-scores of 66.3%, 50.1%, and 72.7%, respectively.5  Figure 1 shows the distribution of the ages of applicants.Most interviewees are between 17 to 19, which is an accurate reflection of the ages of high school students applying to colleges.Figure 2 shows the distribution of the applicants' countries of origin.There are 38 countries in total.The majority of applicants come from China.Other major countries are Belgium, Bangladesh, Canada, India, and Belarus.The gender distribution of applicants is shown in Figure 3.The numbers of male and female applicants are close, with the exclusion of applicants not providing gender information.

Speaker Diarization
Speaker diarization is the task of segmenting an audio stream into utterances according to the speaker's identity and is considered critical in automatic transcription [35].Conversation data with diarization errors can lead to a major failure in building robust dialogue models.Our most accurate transcriber, RevAI, still gives 27.3% errors for speaker diarization (Section 3.1).The main reason is that audios from the interviewer (S1) and the interviewee (S2) are recorded in one channel, so that they are saved in a single waveform, while no clear pauses exist between S1 & S2's speeches or their speeches often overlap.The following example illustrates when the speech of S2 ( :::::::: underlined) is not recognized as a separate utterance: S1: Hi , it 's nice to meet you .:::: Nice :: to ::::: meet ::: you : .S2: Um , can you tell me what is a topic that um , you cannot stop talking about ?Thus, speaker diarization models are developed to provide clean data to our dialogue model (Sec.3.3).Figure 4 depicts the distributions of different types of diarization errors found in 100 dialogues.Most errors are caused by filler words and arbitrary concatenation (joining multiple utterances as one with no apparent patterns, not caused by filler words).

Manual Annotation
440 dialogues are sampled, in which every token is annotated either 1 if it is one of the last two tokens of an utterance before the speaker is switched, and 0 otherwise.For the above example, the 8-9'th tokens are the last two tokens of the utterance before it switches to S2 and so are the 13-14'th tokens before switching to S1; thus, they are annotated 1:6 Hi , it 's nice to meet you .Nice to meet you .
Doccano is used as the annotation tool [36], and ELIT is used for the tokenization [37].To measure the inter-annotator agreement, ten dialogues are double-annotated that show a high kappa score of 84.4%.

Pseudo Annotation
Because our annotated data are relatively small, a larger dataset is pseudo-created for this task using 2,400 dialogues in the Switchboard [38] and 6,808 dialogues in the BlendedSkillTalk [39] datasets (thus, a total of 9,208 dialogues).These two datasets are chosen because their dialogues sound more speech-originated than others, having an adequate amount of filler words.Among the four types of diarization errors (Figure 4), the ones caused by filler words (33%) can be simulated on dialogues that do not contain such errors using statistical heuristics. 7he errors associated with filler words are pseudo-inserted into dialogues from the two datasets by finding an utterance either beginning or ending with a filler word and concatenating it with an utterance before or next to it.Global search is made to the entire dialogues for finding such utterances to mimic the distributions in Table 2 such that about 40.4% of the dialogues in the pseudo-created data would contain two utterances with diarization errors, where 46.7% of them are caused by the filler word okay, and so on.It is possible that more than two utterances get joined; in our case, up to 8 utterances are concatenated.Table 3 includes the statistics of our pseudo-created dataset for transfer learning.

Joint Model
The joint model consists of two parts.First, we establish a binary classification task that enforces the model to learn to differentiate utterances that have diarization errors.The second part is a diarization model to tackle the problem specifically.The intention behind this design is that the binary classification task could enhance the embedding representation on a higher level to perform the diarization task better.
Figure 5 shows an overview of our speaker diarization model.Let U i = {w • i , w i1 , .., w in } be the i'th utterance to be handled, where w • i is the special token representing U i and w ij is the j'th token in U i .U i is fed into the encoder E that generates the embeddings {e • i , e i1 , .., e in }.The previous utterances {U i−k , .., U i−1 } are also fed into E that generates {e • i−k , .., e • i−1 } (in our case, k = 5 that is the context window).These embeddings are fed into a transformer layer for utterance-level weighting, which creates the context embedding e c .Finally, e c ⊕ e • i is fed into a softmax layer that outputs o u to make a binary decision of whether or not U i includes any error.Jointly, each e c ⊕ e ij is fed into another softmax that outputs o j to decide whether or not w ij is one of the last two tokens of an utterance.

Encoder
Transformer . The overview of our diarization model.(n + m) × d

Sliding Window
The sliding window technique aims to overcome the limitation of input length by separating a long sentence into multiple sections within between.The mathematical formulations are described below.
Let n = m + e be the max-number of tokens that E and D accept (e < m < n).Every utterance U whose length is greater than n is split into U 1 and U 2 as follows (w i is the i'th token in U): For utterances whose lengths are less than or equal to n, zero-padding is used to transform E 's output from R n×d to R (n+m)×d .

Context Attention
Let U i be the i'th utterance to be generated as output.Let C ∈ R ℓ×d be the context matrix stacking the embedding matrices of the previous utterances {E i−k , .., E i−1 }, where k is the number of previous utterances to be considered and ℓ = k(n + m).The transpose of C is multiplied by the attention matrix A ∈ R ℓ×n such that C T • A → S T ∈ R d×n .Thus, S ∈ R n×d represents the context summary of U i−k , .., U i−1 , which is fed into the decoder D.

Topic Storing
Even with the context attention, the model still has no memory of contexts prior to U i−k , leading it to repeat the same topics that it has already initiated.To overcome this issue, topic storage is introduced to remember key topics derived by the interviewer.Every interview in our data came with 8-16 questions by the interviewer annotated after each interview by the data provider, who used those questions during the interview and thought they led to assessing crucial aspects of the interviewee.Our final model considers these questions the "key topics" and dynamically stores them as the dialogue progresses.During training, these questions are converted into embeddings and stored dynamically as a list of topics discussed in previous turns.During decoding, the model generates such topical questions with a specific flag and stores them in the same way.Let Q = {q 1 , .., q h } be the topical question set.During training, D learns to generate Q instead of S1 as the first token of the interviewer's utterance that contains any q i ∈ Q.In addition, it generates B/E if the interviewer begins/ends the current dialogue with that utterance (Table 7).Any utterance starting with Q is encoded by E and Feed-forward layers that create abstract utterance embedding v i ∈ R 1×d to represent topics.These embeddings get stacked as the interview goes on to create the topic matrix V ∈ R h×d .If |Q| < h, then zero-padding is used to create V (in our case, h = 16).Finally, V is stacked with the context matrix C (Sec.3.3.2),and (V ⊕ C) T ∈ R d×(h+ℓ) is multiplied by the attention matrix A ∈ R (h+ℓ)×n to create the transpose of the context summary matrix S ∈ R n×d .

Speaker Diarization Results
Table 3 shows the distributions of the pseudo-created data (Section 3.2.2),as well as our interview data (Section 3.1) before and after the diarization where errors in the train/dev/test sets are manually annotated (Section 3.2.1)and errors in the raw set are automatically corrected by the joint model (Section 3.2.3).For the encoder, the RoBERTa large model is used [40]. 8After diarization, S2's utterances with diarization errors get split such that the average length of S2's utterances decreases while the average length of dialogues slightly increases.Meanwhile, some parts of S2's utterances, incorrectly separated from S1's utterances by the transcriber, are recovered back to S1; thus, the average length of S1's utterances increases.
Table 4 shows results of three models: the baseline model taking U i and producing O w = {o 1 , .., o n }, the context model taking U c = {U i−k , .., U i } and producing O u , as well as the joint model taking U c and producing O u and O w (Figure 5).The baseline model does not create e c , so e i * are directly fed to Softmax 2. Also, the baseline and context models do not use e • i , so only Softmax 2 is used to produce the outputs.For evaluation, the F1-scores of the label 1 on the last two tokens  When trained on only our data, all models perform similarly.The joint model slightly outperforms the others when transfer learning is applied.Although the improvement is marginal, the joint model has a benefit of identifying utterances with diarization errors, showing the F1 score of 93.6% for this task, while the transferred models generally show much higher performance on the other datasets than the non-transferred models.Thus, the joint transferred model is used to auto-correct all dialogues in RAW.

Dialogue Generation Results
For our experiments, on the diarized data from the diarization model, the encoder and the decoder in BlenderBot 1.0 [3] are used. 9Three models are developed as follows: All models are first trained on raw and finetuned on TRN in Table 1).We followed the setups on the training parameters in the original Blenderbot paper.To assess real-life performance, ten interviews are conducted per model, where each interview consists of exactly 30 turns.Qualitative analysis is performed on the top-3 most frequently occurring errors as follows: • Repetitions: how often it repeats topics already covered in the previous utterances.

•
Early Ending (EE): implies ending the interview without covering a sufficient amount of topics.

•
Off Topic (OT): how often it makes utterances that are not relevant to the current topic.
Table 5 shows the error analysis results.The repetition rates are significantly reduced as the model gets more advanced.Compared to the baseline, the CT model conducts 3.5 times longer conversations before it attempts to end the interview while generating twice fewer off-topic utterances, which is very promising.Examples of these error types are provided in Appendix 5.3.
Table 5.The error analysis of all generation models.R: avg-% of repeated topics, EE: avg-% of the interview conducted before the model attempts to end (higher is better), OT: avg-% of off-topic utterances.

Static Evaluation
Following previous work [43], static evaluation is performed on the CT model, where the input is every batch of k-utterances and prior topics per interview, and its output is compared to the corresponding human response in TST (Table 1).The average BLEU score is 0.08 and cosine similarity is 0.19, which are low.However, such static evaluation assesses each output independently and obstructs dialogue fluency by artificially inserting human utterances into the model, and thus, does not reveal its capability in conducting long contextualized interviews.

Real-time Evaluation
The CT model is deployed to an online text-based platform in a public cloud.For real-time evaluation, five professional interviewers and ten students are invited to have conversations with our InterviewBot and give ratings from 1 to 5 to indicate their overall satisfaction.The average dialogue duration is 256 seconds.Almost half of the evaluators are satisfied (Scores 4 and 5) and another 40% indicate a positive attitude on the coverage of topics and discussions (Score 3), implying that it performs reasonably well for this realistic setting (Table 6).Overall, with an average score of 3.5, the InterviewBot has shown great potential in applying to practical applications.7 presents an example dialogue conducted by our chatbot, showcasing the utilization of sliding window and topic storing (CT) techniques.Overall, the chatbot demonstrates its ability to conduct a comprehensive interview by asking relevant follow-up questions, adapting to various conversation topics, and providing meaningful responses accordingly.

Examples of Diarization Errors
The following are examples to illustrate the sources of diarization errors ( ::::::::: underlined).In many cases, interviewers and interviewees overlap in speeches or thinking out loud with or without filler words, which concatenates the two utterances.A small portion of diarization errors are from speech recognition and word repetition errors.

• Filler Words
Oh, no, I'm going to make majoring mathematics.:::: Okay.:::::: Okay.Now why, why do you think receiving an education is important?
We conduct speaker diarization errors by annotating 100 conversations with an average of 39 turns.The type of errors with their stats are shown in Table 8.The major errors are caused by filler words and arbitrary concatenation.Counts of single errors are counted.Table 9 listed examples underlined for each type of error, respectively.Hunger-Free language school.WR I heard it said, so it's kind of like a DIY ::::::::: community :::::::::: community.

FW
Oh, no, I'm going to make majoring mathematics.::::: Okay.::::: Okay.Now why, why do you think receiving an education is important?
From the table, there are different occasions for ASR errors, such as words not transcribed, words transcribed incorrectly, etc. Word repetitions happen rarely when non-filler words repeat in a sentence.Concatenations of two adjacent utterances take the largest portion of all errors.

Examples of Generation Limitations
The following are examples for each type of generation error ( :::::::: underlined).An early ending prevents the Interviewbot from covering critical aspects of applicants and further discussion.The Interviewbot cannot name mentions properly, either skipping over or generating the wrong entities in the conversations.Random Generation generates random words and punctuation and concatenates them incorrectly.Not following up on previous topics hinders the Interviewbot from becoming proactively engaged with interviewees.
• Repetition InterviewBot: Now why, why do you think receiving an education is important?... more turns ... InterviewBot: Why do you think people value education so much?

Conclusion
Our InterviewBot is a model-based dialogue system equipped with contextual awareness and topic sensitivity that conducts college admission interviews.Questions covering diverse topics and discussions in extended follow-ups are carried along the conversations, which have been assessed by professional interviewers and student volunteers.The average satisfaction score of 3.5 projects the prevailing deployment of the InterviewBot for thousands of college applicants, especially for international students.With promising future applications, however, the current version of the InterviewBot has two major limitations.First, the early ending in Table 5 still happens, where an ending utterance gets generated after an insufficient amount of turns, in which case, the interview may not cover critical aspects of the applicant.Second, the bot makes good follow-ups to various topics; however, it needs to derive deeper discussions with more details.
In future work, the main focus is to enrich the follow-up discussions on topics or sub-topics during interactions by training the InterviewBot with more structured topic-flow materials.This task would indirectly alleviate the early ending issue by deepening the discussions on certain topics.

Figure 6
Figure 6 depicts an overview of our dialogue generation model.Since inputs to the encoder E and the decoder D are limited by the total number of tokens that the pretrained language model accepts, sliding window (Sec.3.3.1)and context attention (Sec.3.3.2) are proposed to handle long utterances and contexts in the previous utterances, respectively.In addition, topic storing is used to remember user-oriented topics brought up during the interview (Sec.3.3.3).The input to E and output of D include the speaker ID S1, S2 or special tokens to indicate the beginning B, the ending E and topic questions Q as the first token followed by an utterance from the interviewer or interviewee, respectively.Hyperparameters are finetuned by cross-validations.

Figure 6 .
Figure 6.The overview of our dialogue generation model.

•
BB: Blenderbot Baseline Model • SW: Blenderbot with Sliding Window • CT: Blenderbot with Sliding Window and Concatenation of Topic Storing

Table 2 .
Distributions of filler words w.r.t.diarization errors.Dist: percentage of dialogues containing # number of utterances with errors caused by the filler words.filler_word: percentage of the filler word appearing in the corresponding dialogue group.

Table 3 .
Distributions of the pseudo-created datasets (Switchboard, BST) and our interview data (before and after diarization).D: number of dialogues, U: avg-number of utterances, S1/S2: avg-number of tokens per utterance by S1/S2.TRN/DEV/TST: training/development/evaluation (annotated) sets.RAW: unannotated set.Note that we follow the same splits suggested by the original papers of the Switchboard and BST datasets for comparability.

Table 4 .
Diarization model performance.Ours: trained on TRN of our Interview data (After) in Table3.Transferred: trained first on the TRN mixture of Switchboard and BST, then finetuned on TRN of our data.

Table 7 .
An interview dialogue conducted by our best model (CT in Section 4).S1/S2: interviewer/interviewee (chatbot/human), B/E: beginning/ending utterance (chatbot), Q: topical question (chatbot).Hello, and thank you so much for coming today.uh,Letusstartby having you just to state your full legal name and also the school that you are currently attending?S2Sure, my name is David, and I go to Shanghai high school in China.S2Sure, when I was a junior, I was elected to be the Vice President in the Student Union and I organized many activities to help students to enrich their life on campus.Okay, great.well, thank you very much for sharing that with me.um, and thank you for taking the time out of your day to interview with us.
S1Okay, great.Thank you so much for sharing that.so,um,tell me a little bit about your leadership style.What do you think you bring to the student union?S2Sure, I am very open-minded, and with some of my colleagues, we together established a lot of new programs on top of our existing ones.Many of those programs became very popular.

Table 8 .
Speaker diarization error analysis in the text transcription from interview recordings.Notions are Misc: Miscellaneous; ASR: Auto Speech Recognition; WR: Word Repetition; FW: Filler Words; AC: Adjacent Concatenation.

Table 9 .
Error type examples.