Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics
Abstract
:1. Introduction
- Interaction mode: The dialog system could receive input from the user in the form of text or voice.
- Design approach: The system may be modeled to provide answers about a particular topic (closed domain) or a range of topics (open-domain).
- Search-based: Search-based systems assist the user in fulfilling their information needs.
- Goals: The system may be designed for task completion (task-oriented), or carrying on conversations (non-task oriented).
- Knowledge domain: The response may be selected by following a pre-defined rule (rule-based), retrieving pre-defined responses (retrieval based), or generating new responses entirely (generative models).
2. Motivation
- What are the evaluation methods available for Dialog systems based on the structure of the dialog?
- What is the requirement of an automated evaluation method for testing the usability of dialog systems?
3. Structural Flow of the Article
4. Components of Dialog Systems
4.1. Conversational Interface
4.2. Natural Language Understanding
- 1.
- Intents: Intent classification is an important component of natural language understanding (NLU) which classifies and labels the input given by the user to assign it a specific goal [37]. It enables the dialog system to understand user requests. For example, the user may input a question such as “What time does the store open?”. The intent for this particular request is most likely “Opening Times”, which can then allow the system to answer with the opening hours of the store. An intent represents a mapping between what a user says and what action should be taken by the software.
- 2.
- Named entity recognition: Named entity recognition (NER) is the task of identifying named entities in the text [38]. Entities are pre-defined categories like names, locations, quantities, expressions of times, etc. Entity extraction enables the system to extract information from the text and helps in organising in text. For example, the name of a store, location of a venue, and the fee of a particular service could be considered entities. An entity represents concepts that are often specific to domain as a way of mapping natural language phrases to canonical phrases that capture their meaning.
- 3.
- Pattern Matching: Pattern matching helps to match the input obtained from the user with the database and try to obtain an appropriate response [39]. Lee et al. [40] list the algorithms available for pattern matching such as fuzzy string matching, regular expressions [41], rule-based matching, token-based matching, etc.
- 4.
- Parsing: Text parsing is the process of determining the syntactic structure of a text. It separates a given corpus into smaller components based on some rules [42]. Parsing algorithms parse the text in accordance to the predefined rule of algorithms such as left-right and bottom-up algorithms. These algorithms learn to recognize strings and assign syntactic structures to the strings [43]. Figure 5 shows an example of a parsing in action. The syntactic structure of the sentence ’The message is the answer’ is obtained by dividing its constituent words according to their grammatical function.
- 5.
- TF-IDF: Term frequency- inverse document frequency (TF-IDF) weight measures the importance of a word to the document it belongs to in a corpus. It converts user input into vectors and finds similarity with documents. The system returns high priority results from the chatbot by comparing higher cosine similarity with TF-IDF. Cosine Similarity measures the content-based similarity between two vectors which represent the text summary and reference system summary in the vector space. TF-IDF models can also serve as similarity metrics for evaluation of dialogue [44,45].
- 6.
4.3. Dialog Manager
- Dialogue modeling: It keeps track of the state of the dialogue.
- Dialogue control: It makes decision about the next step to be taken by the system.
4.4. Domain Specific Component
4.5. Response Generator
4.5.1. Rule-Based Systems
4.5.2. Corpus-Based Systems
5. Types of Dialog Systems
5.1. Task-Oriented Dialog Systems
Dialog Structure
5.2. Conversational Dialog Systems
Modelling Conversational Dialog Systems
- Generative models: Generative models are an unsupervised approach to generate responses for the dialog system. The response generation step provides the output to the user. Natural language generation (NLG) transforms the structured data into natural language to provide an output to the user. In order to successfully generate appropriate responses to the user query, it is important that the user intent and context of the conversation are understood comprehensively. Generative models make use of deep neural networks. The dialog is generated by training the model on a large corpus of dialogues, and the most appropriate response to a given user utterance is returned. The experiment done by Serban et al. [79] to create a generative model involved two models which are Recurrent neural networks (RNNs) [80] and Hierarchical recurrent encoder decoded architecture (HRED). Vinyals et al. [81] and Serban et al. [82] demonstrate the application in dialog response generation for both LSTM and HRED architectures respectively.Language models are probabilistic or statistical models that determine the probability of the occurrence of a word from a given corpus. Contrary to rule-based algorithms, language models attempt to understand the contextual relationships between different words in a sentence. Responses generated from a language model are thus more relevant in the given context. One of the most popularly used language models is the Bidirectional encoder representations from transformers (BERT). Devlin et al. [83] introduce BERT, which is a specific, large transformer masked language model. For a masked language model (MLM), you train the model by removing words and having the model fill in the correct word. Masked language models are useful because they are a type of contextual word embedding. Contextual word embedding allows you to have different word representations for different contextual meanings of the word. The BERT architecture uses a stack of either 12 (BASE) or 24 (LARGE) Encoders. It can be used as a general-purpose pre-trained model that is fine-tuned for specific tasks. BERT is a transferable model, thus it can be used as input to smaller, task specific models. With successful fine tuning of the model, we can achieve a very high accuracy. There are pre-trained BERT models available in over 100 languages. There are certain extensions of the BERT architecture such as RoBERTa [84], DistilBERT [85], AlBERT [86], and more. BERT models are also present in different languages such as CamemBERT (French) [87], AraBERT (Arabic) [88] and mBERT (multilingual) [89]. Drawbacks are that the model is very big, slow to train, computationally expensive, and needs to be fine-tuned for downstream tasks.
- Utterance Selection: The modeling of the dialog system is done as an information retrieval task in utterance selection. Candidate responses are ranked according to their relevance. The most appropriate response is retrieved from the database according to a given utterance. Figure 6 shows the extraction of values from an utterance. A probability could be calculated which then ranks the candidate utterances according to their relevance. The utterances in a dialog database help define the dialog structure [90], as the system learns to map the semantically relevant responses to the user utterances. An utterance selection model as defined by Baxter et al. [91], is a set of rules that help in filling slots for response generation. The similarity between the dialog history and candidate utterances can be measured by a similarity measure. Surface form similarity measures the similarity based on token level. Examples of these measures include METEOR [92] and Term Frequency-Inverse Document Frequency (TF-IDF) models, discussed in Section 4.2 [44,45]. The recurrent surface text pattern approach is proposed by Duplessis et al. [93] which involves a database of recurrent surface text patterns and the utterance retrieval from the database through a generalised vector-space model.
5.3. Question Answering Dialog Systems
Evaluation
6. Classification Based on Logic
6.1. Rule-Based Systems
6.2. Intelligent Systems
6.2.1. Deep Learning and Machine Learning Based Systems
6.2.2. Prototypes
- Dialogflow: Dialogflow [114] provides a conversational experience powered by Artificial intelligence. Singh et al. [115] discuss the advantages of a conversation management framework such as Dialogflow in certain use cases. It can receive responses in both text and voice form, and can be integrated across virtually any platform such as speakers, home applications, wearable sensors, etc. The following components of Dialogflow are discussed in [116]: Intent matching, Entity extraction, and Dialog control. The intent matching step recognizes what the user wants. An intent is created for anything the user might request. For example, checking the price of a product, booking an appointment, checking opening hours, etc. A typical Dialogflow agent, which represents a single conversational experience, might have a few to a thousand intents, which are each trained to recognize a specific user need. Entity extraction extracts relevant information from the user query. Dialogflow can extract information using system entities. For example, in Figure 8, the intent is identified as ’Opening_time’ and ’time’ entity is extracted. Dialog control shapes the flow of the conversation. The subsequent dialogues are interpreted in the context of the previous input.
- 2.
- Alexa: Amazon Alexa [18] is a virtual assistant incorporated with the Internet of Things (IoT). It can respond to any natural language query. One of the many outstanding features of Alexa is the developer tool. Every query given to it is parsed into a data structure and given to the user on AWS, where the user can then write their own custom code around it. Lopatovska et al. [118] explore user interactions with Alexa. Analysis of the data collected suggested that Alexa serves well as a virtual assistant for actions such as playing music, checking weather conditions, etc.
- 3.
- RASA: Bocklisch et al. [112] introduce Rasa, which is an open source machine learning framework for building contextual AI assistants and conversational agents. The model is transparent, which means we can observe exactly what is happening under the hood and customize things precisely. It is a state-of-the-art model and is the most effective and time efficient tool to build complex dialog systems quickly. In RASA, an action is an operation which can be performed by the bot. It could be replying something in return, querying a database, or any other task possible by code. Action could be just a hard coded reply or some API generated response. Stories are a sample interaction between the user and bot, defined in terms of intents captured and actions performed. Harms et al. [119] discuss the use of RASA for dialog management. RASA is made of two components: Natural language understanding (NLU) and dialog management. Natural language understanding is an open-source natural language processing tool for intent classification and entity extraction. It thus helps the bot to attempt to understand the user. Singh et al. [115] explain the implementation of RASA Core. The core is the framework for machine learning-based contextual decision making. It learns by observing the patterns from conversational data.The RASA architecture follows the steps given:
- (a)
- The message is received and passed to an interpreter, which converts it into a dictionary including the original text, the intent, and any entities that were found. This part is handled by the NLU.
- (b)
- The Tracker is the object which keeps track of conversation state. It receives the info that a new message has come in.
- (c)
- The policy receives the current state of the tracker.
- (d)
- The policy chooses which action to take next.
- (e)
- The chosen action is logged by a tracker and a response is sent to the user.
6.2.3. Reinforcement Learning Based System
7. Evaluation
7.1. Empirical Evaluation Metrics
7.1.1. Embedding-Based Metrics
- Greedy Matching: Based on the cosine similarity of their word embeddings, the tokens in two sequences are greedily matched [143]. The total score is then calculated by taking an average of all words. The greedy matching approach generally favors responses with key words that are semantically similar to those in the ground truth response [144]. Rus and Lintean [145] compare greedy and optimal matching methods for two intelligent tutoring systems. The greedy method does not obtain the global maximum similarity score between the candidate and ground responses.
- Embedding Average: Embedding average metric calculates sentence-level embeddings using additive composition. Additive composition computes the meanings of phrases by averaging the vector representations of their constituent words [146,147,148]. The embedding average is defined as the mean of the word embeddings of each token in a sentence. The cosine similarity between the respective sentence level embeddings is computed to compare a ground truth response and retrieved response [25].
7.1.2. BLEU
7.1.3. ROUGE
7.1.4. METEOR
7.1.5. Perplexity
7.2. User Evaluation Methods
7.2.1. Task-Oriented Dialog Systems
- User Satisfaction Modeling: The user satisfaction for any dialog system is a good indicator of it’s usability. User satisfaction modeling can be conducted in three steps [12]. There is a requirement for explainability, which quantifies the impact that different properties of the dialog system have on user satisfaction. The evaluation process also has to be automated based on the properties of the dialog system. Differentiability requirement evaluates different dialog strategies using models. Two factors need to be considered while user satisfaction modeling. The agent evaluating the dialog system and the granularity at which the system is being evaluated are the major factors. The dialog could be evaluated by the user or by objective judges. The granularity of the dialog lies on two extremes: the evaluation could take place at the dialog level or exchange level. Engelbrecht et al. [159] present a method to model user satisfaction with the use of hidden markov models. Mean squared error in predicting the most probable state at each turn was calculated for the optimized model. This approach provided a way to analyze the models and features that affect the quality ratings, making it comparable to empirical ratings.
- User Simulation: User simulators are tools designed to simulate user behaviors. Georgila et al. [160] make use of n-gram user simulation models to evaluate spoken dialog systems. Schatzmann et al. [161] discuss the development of user simulation techniques with reinforcement learning. Two evaluation strategies can be deployed for user simulators: direct and indirect. Direct evaluation is performed on the basis of various metrics, such as precision and recall on the dialog acts, perplexity, etc. Indirect evaluation attempts to evaluate the trained dialog manager. It measures the utility of the user simulation. Kreyssig et al. [162] propose the neural user simulator (NUS), which is a neural network based evaluation approach.
- User Experience (UX): An important exploration was done by Holmes et al. [163] to test the applicability of conventional methods to assess conversational user interface. There are important questions raised in the paper such as how the evaluation results will correlate when applied to conversational agent usability. The usability of the WeightMentor App conducted by Holmes et al. [128], is calculated by using three metrics. The System Usability Scale (SUS) developed by Brooke [164] is one of the most popular and commonly used means of assessing usability. There are a total 10 questions, 5 covering positive aspects and 5 covering negative aspects of the dialog system. Each question is scored out of five. Final scores are then calculated out of 100. The mean WeighMentor score places the system in the 96th ± 100th percentile. Schrepp et al. [165] applied another metric called the User Experience Questionnaire (UEQ), which thoroughly assesses the UX. UEQ tells us to what level does the system meets the user expectation and how it tests against other dialog systems. The system performed well in all UEQ scales. The final metric is the Chatbot Usability Questionnaire (CUQ). Participants were given 16 items relating to the positive and negative aspects of the system. The questions were ranked out of 5, a scale of “Strongly Agree” and “Strongly Disagree”. The CUQ test provided a mean score of 76.20 with the highest score of 100. Sharma et al. [166] introduce Atreya Bot, which is developed to facilitate chemical students and researchers in performing drug related queries from the ChEMBL database. The study aims to simplify the process of performing a successful search, outlining the challenges present in fulfilling a query. User frustration and mental workload is relatively low while searching on a conversational agent as compared to a traditional search engine.
- PARADISE Framework: The Paradigm for Dialog System Evaluation (PARADISE), proposed by Walker et al. [29], is one of the best-known evaluation frameworks proposed for task-oriented systems. PARADISE framework is based on user ratings on the dialog level and also allows for evaluations of sub dialogues. Figure 11 explores the structure of objectives in the evaluation of a dialog system according to the PARADISE framework. The utterances are compared with a reference answer to perform automatic evaluation. This method has certain limitations, such as the evaluation process cannot discriminate between different strategies, the approach does not always generalize well, and the dialog performance cannot be attributed to system specific properties [167]. The user interacts with the dialog system and proceeds to complete a questionnaire [168]. Responses in the questionnaire are then used to compute a user satisfaction score. This score can be used as the target variable for a linear regression model. Linear regression models can then be trained with the logged conversations serving as input variables. The model can then be fitted to predict the user satisfaction for the given input variables. Variables such as task-success can be extracted automatically, whereas variables such as inappropriate repair utterances need to be manually extracted by experts. The system also performs well with differing user populations and is good at performing predictions for new systems.
7.2.2. Evaluation of Conversational Dialogue Systems
- Appropriateness: It is a coarse-grained concept of dialog evaluation. Many fine-grained concepts, such as coherence, relevance, and correctness are also encapsulated within it. The main approach employed includes the word-overlap metrics, originally used in machine translation and summarization. Word-overlap metrics have been discussed in Section 7.1. Word-overlap based scores include BLEU score and ROUGE score, which can serve as an approximation for the appropriateness of an utterance. One drawback of these metrics is that they show no correlation to human judgements. Galley et al. [140] propose ΔBLEU, which incorporates human judgements into the BLEU score. The reference responses in the test set are rated by human judges in relevance to the context. In the model based evaluation approach, such as ADEM [129], user behavior is modeled. A broad array of behavioral aspects need to be considered for the model to prove effective. The impact of using various dialog strategies should be explained by the model. The different types of users and the typical errors made by them while using the system are encapsulated by the model.
- Human Likeness: The quality or a conversational agent can be measured by the Turing Test [76]. A conversational system is said to pass the Turing Test if it convincing to a human that it is human as well. The generative adversarial model as proposed by Xu et al. [169] can be used to evaluate dialog systems. The evaluation framework is made up of a generator for generating data and a discriminator to distinguish between real data and artificially generated data. The naturalness of a generated dialog response can be directly calculated by adversarial loss. As explained by Kannan and Vinyals [130], the encoder-decoder architecture employing recurrent neural networks has shown to be particularly helpful in dialog generation for dialog systems. The user query is input by the encoder. The decoder generates a response based on the final state of the first network. The method is based on Generative Adversarial Networks (GANs).
- Fine-grained Metrics: Topic based evaluation measures the ability of the conversational agent to coherently talk about different topics. Two dimensions for topic-based evaluation are considered: topic breadth and topic depth [170]. The ability of the system to learn about a large variety of topics is measured by topic breadth, whereas topic depth measures if the system can sustain a long and cohesive conversation about one topic. A deep averaging network (DAN) can be trained to perform topic classification and detection of topic-specific keywords. A large amount of conversational data and questions are used to train the DAN model. There could be multiple utterances that could be considered as acceptable [171]. If at least one annotator marks the response as appropriate, the response is considered appropriate by the weak agreement metric [172]. The weak agreement has certain limitations. It relies heavily on human annotations and is not applicable to large amounts of data. Voted appropriateness takes into account the number of votes an utterance received for a given context, thus overcoming the limitations of weak agreement [172]. Each utterance is weighted uniquely. Voted appropriateness also has a higher correlation to human judgement as compared to weak agreement.
8. Evaluation Datasets and Challenges
8.1. Datasets for Task-Oriented Systems
8.2. Data for Question Answering Dialog Systems
8.3. Data for Conversational Dialog Systems
8.4. Evaluation Challenge
9. Discussion
- What are the evaluation methods available for Dialog systems based on the structure of the dialog?To answer this question, we have performed a thorough search of evaluation methods and categorized them into empirical and user-based methods. We highlighted several evaluation methods for both task-oriented dialog systems and conversational agents. There was also a distinction made between automated quantitative metrics [23,24] and user evaluation methodologies [124,129]. It is easier to evaluate task-oriented systems as they need to complete a specific task, the efficiency of which could be measured. User satisfaction can also be modeled, as we have covered before, and can be use to automate the evaluation process. Conversational agents are a little trickier to evaluate, especially when reducing human effort. While evaluating conversational agents, we do not have a particular set task or a correct answer. The focus is on the ’quality’ of the dialog. Human judges annotate the ’appropriateness’ of a dialog. The current state-of-the-art for modelling appropriateness is by means of latent representations such as ADEM [129].
- What is the requirement of an automated evaluation method for testing the usability of dialog systems?To elaborate on this research question, it is necessary that empirical metrics show high correlation to human judgements, or the evaluation would be out of context. As these metrics do not correlate strongly to human judgements [25], the need for a more comprehensive evaluation metrics became apparent. Scoring high on an empirical evaluation method does not guarantee that the system performs well if users cannot navigate the system easily. On the other hand, user-based evaluation is very subjective. Thus, a gap exists in the evaluation methods available for dialogue systems. There is considerable amount of work in performing evaluation of dialog system with existing metrics. Several of the evaluation methods require human intervention to perform feature engineering, manually annotate data, and double-check the results provided by evaluation metrics. Human labor is too expensive to obtain in most cases. Certain other factors such as cognitive load on the users and human annotators also need to be considered while performing evaluation. An automated evaluation system can considerably reduce human effort, while also being able to provide insight on system quality to improve usability of the system. Existing evaluation methods depend on the type of dialog system they are evaluating, thereby reducing generalization. There is also a lack of well-defined metrics for evaluation of conversational search agents. There is a long way to go in developing evaluation methods for conversational search systems. Current evaluation methods are either very vague or very specific. The future scope of this study is to develop a comprehensive evaluation system by finding some middle ground between empirical and user-based methods. In this study, we attempt to understand the process of performing comprehensive dialog evaluation, which requires automating the process, making it repeatable, and increasing correlation to human judgements. A semi-automated evaluation approach is provided in the study done by Kaushik et al. [124], which presents the framework for an implicit evaluation method. The framework encompasses various information retrieval (IR) and user-based factors such as user satisfaction, knowledge gain, cognitive load, and so on. There is a possibility for such a standard framework accepted by different research paradigms by drafting it into an Application Programming Interface (API). The advantage of this approach is that all the data is collected together and comparative study can be performed. The data can also be stored and used in the future for further analysis. The limitations of this approach is that there is still some human effort required in the form of expert knowledge. Another approach is to develop an in-built evaluation tool with an interactive system. The system is evaluated with each user interaction and the results are sent directly to the analyser. This approach is time-saving, but it makes one-on-one comparison of dialog systems difficult. The system may be updated with each evaluation which may result in a bias, which could be subject to further investigation. Further exploration of factors which affect the conversation in multiple dimensions would be interesting. The issues and challenges discussed in this paper can be discussed by the future research community to develop better dialog systems that provide enhanced user experience.
10. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Xu, A.; Liu, Z.; Guo, Y.; Sinha, V.; Akkiraju, R. A new chatbot for customer service on social media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 3506–3510. [Google Scholar]
- Quarteroni, S.; Manandhar, S. A chatbot-based interactive question answering system. In Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue 2007, Rovereto, Italy, 30 May–1 June 2007; pp. 83–90. [Google Scholar]
- Prochaska, J.J.; Vogel, E.A.; Chieng, A.; Kendra, M.; Baiocchi, M.; Pajarito, S.; Robinson, A. A Therapeutic Relational Agent for Reducing Problematic Substance Use (Woebot): Development and Usability Study. J. Med. Internet Res. 2021, 23, e24850. [Google Scholar] [CrossRef]
- Madhu, D.; Jain, C.N.; Sebastain, E.; Shaji, S.; Ajayakumar, A. A novel approach for medical assistance using trained chatbot. In Proceedings of the International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 10–11 March 2017; pp. 243–246. [Google Scholar]
- Følstad, A.; Nordheim, C.B.; Bjørkli, C.A. What makes users trust a chatbot for customer service? An exploratory interview study. In International Conference on Internet Science, Proceedings of the 5th International Conference, INSCI 2018, St. Petersburg, Russia, 24–26 October 2018; Springer: Cham, Switzerland, 2018; pp. 194–208. [Google Scholar]
- AbuShawar, B.; Atwell, E. ALICE chatbot: Trials and outputs. Comput. Sist. 2015, 19, 625–632. [Google Scholar] [CrossRef]
- Weizenbaum, J. ELIZA—A computer program for the study of natural language communication between man and machine. Commun. ACM 1966, 9, 36–45. [Google Scholar] [CrossRef]
- Csaky, R. Deep learning based chatbot models. arXiv 2019, arXiv:1908.08835. [Google Scholar]
- Shawar, B.A.; Atwell, E.S. Using corpora in machine-learning chatbot systems. Int. J. Corpus Linguist. 2005, 10, 489–516. [Google Scholar] [CrossRef]
- Haristiani, N. Artificial Intelligence (AI) chatbot as language learning medium: An inquiry. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2019; Volume 1387, p. 012020. [Google Scholar]
- McTear, M.F.; Callejas, Z.; Griol, D. Toward a Technology of Conversation. In The Conversational Interface; Springer: Berlin/Heidelberg, Germany, 2016; Volume 6, pp. 25–50. [Google Scholar]
- Deriu, J.; Rodrigo, A.; Otegi, A.; Echegoyen, G.; Rosset, S.; Agirre, E.; Cieliebak, M. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 2021, 54, 755–810. [Google Scholar] [CrossRef] [PubMed]
- Radlinski, F.; Craswell, N. A theoretical framework for conversational search. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, Oslo, Norway, 7–11 March 2017; pp. 117–126. [Google Scholar]
- Wei, Z.; Liu, Q.; Peng, B.; Tou, H.; Chen, T.; Huang, X.J.; Wong, K.F.; Dai, X. Task-oriented dialogue system for automatic diagnosis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, 15–20 July 2018; pp. 201–207. [Google Scholar]
- Hoy, M.B. Alexa, Siri, Cortana, and more: An introduction to voice assistants. Med. Ref. Serv. Q. 2018, 37, 81–88. [Google Scholar] [CrossRef]
- Siri. Available online: https://www.apple.com/siri/ (accessed on 6 October 2021).
- Cortana. Available online: https://www.microsoft.com/en-us/cortana (accessed on 6 October 2021).
- Amazon Alexa. Available online: https://alexa.amazon.com (accessed on 3 October 2021).
- Kaushik, A.; Bhat Ramachandra, V.; Jones, G.J. An interface for agent supported conversational search. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, Vancouver, British Columbia, 14–18 March 2020; pp. 452–456. [Google Scholar]
- Chandra, Y.W.; Suyanto, S. Indonesian chatbot of university admission using a question answering system based on sequence-to-sequence model. Procedia Comput. Sci. 2019, 157, 367–374. [Google Scholar] [CrossRef]
- Sreelakshmi, A.; Abhinaya, S.; Nair, A.; Nirmala, S.J. A question answering and quiz generation chatbot for education. In Proceedings of the 2019 Grace Hopper Celebration India (GHCI), Bangalore, India, 6–8 November 2019; pp. 1–6. [Google Scholar]
- Cui, L.; Huang, S.; Wei, F.; Tan, C.; Duan, C.; Zhou, M. Superagent: A customer service chatbot for e-commerce websites. In Proceedings of the ACL 2017, System Demonstrations, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 97–102. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. ROUGE: A Packagefor Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post2Conference Workshop of ACL, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Liu, C.W.; Lowe, R.; Serban, I.V.; Noseworthy, M.; Charlin, L.; Pineau, J. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv 2016, arXiv:1603.08023. [Google Scholar]
- Gunasekara, C.; Kim, S.; D’Haro, L.F.; Rastogi, A.; Chen, Y.N.; Eric, M.; Hedayatnia, B.; Gopalakrishnan, K.; Liu, Y.; Huang, C.W.; et al. Overview of the ninth dialog system technology challenge: Dstc9. arXiv 2020, arXiv:2011.06486. [Google Scholar]
- Hara, S.; Kitaoka, N.; Takeda, K. Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System. In Proceedings of the LREC, Valletta, Malta, 17–23 May 2010. [Google Scholar]
- Yang, Z.; Levow, G.A.; Meng, H. Predicting user satisfaction in spoken dialog system evaluation with collaborative filtering. IEEE J. Sel. Top. Signal Process. 2012, 6, 971–981. [Google Scholar] [CrossRef]
- Walker, M.A.; Litman, D.J.; Kamm, C.A.; Abella, A. PARADISE: A framework for evaluating spoken dialogue agents. arXiv 1997, arXiv:cmp-lg/9704004. [Google Scholar]
- Malchanau, A.; Petukhova, V.; Bunt, H. Multimodal dialogue system evaluation: A case study applying usability standards. In 9th International Workshop on Spoken Dialogue System Technology; Springer: Singapore, 2019; pp. 145–159. [Google Scholar]
- Arora, S.; Batra, K.; Singh, S. Dialogue system: A brief review. arXiv 2013, arXiv:1306.4134. [Google Scholar]
- Fraser, N.; Gibbon, D.; Moore, R.; Winski, R. Assessment of interactive systems. In Handbook of Standards and Resources for Spoken Language Systems; Mouton de Gruyter: Berlin, Germany, 1998; pp. 564–615. [Google Scholar]
- Oviatt, S. Multimodal interfaces. In The Human-Computer Interaction Handbook; CRC Press: Boca Raton, FL, USA, 2007; pp. 439–458. [Google Scholar]
- Klopfenstein, L.C.; Delpriori, S.; Malatini, S.; Bogliolo, A. The rise of bots: A survey of conversational interfaces, patterns, and paradigms. In Proceedings of the 2017 Conference on Designing Interactive Systems, Edinburgh, UK, 10–14 June 2017; pp. 555–565. [Google Scholar]
- McTear, M.; Callejas, Z.; Griol, D. The dawn of the conversational interface. In The Conversational Interface; Springer: Berlin/Heidelberg, Germany, 2016; pp. 11–24. [Google Scholar]
- Allen, J. Natural Language Understanding; Benjamin-Cummings Publishing Co., Inc.: San Francisco, CA, USA, 1988. [Google Scholar]
- Ravuri, S.; Stoicke, A. A comparative study of neural network models for lexical intent classification. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 368–374. [Google Scholar]
- Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
- Aimin, F.C.H. Automatic recognition of natural language based on pattern matching. Comput. Eng. Appl. 2006. [Google Scholar]
- Lee, G.G.; Seo, J.; Lee, S.; Jung, H.; Cho, B.H.; Lee, C.; Kwak, B.K.; Cha, J.; Kim, D.; An, J.; et al. SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP. In Proceedings of the TREC, Gaithersburg, MD, USA, 13–16 November 2001. [Google Scholar]
- Chatterjee, N.; Kaushik, N. RENT: Regular expression and NLP-based term extraction scheme for agricultural domain. In Proceedings of the International Conference on Data Engineering and Communication Technology; Springer: Singapore, 2017; pp. 511–522. [Google Scholar]
- Ranjan, N.; Mundada, K.; Phaltane, K.; Ahmad, S. A Survey on Techniques in NLP. Int. J. Comput. Appl. 2016, 134, 6–9. [Google Scholar] [CrossRef]
- Huyck, C.R.; Lytinen, S.L. Efficient heuristic natural language parsing. In Proceedings of the AAAI, Washington, DC, USA, 11–15 July 1993; pp. 386–391. [Google Scholar]
- Charras, F.; Duplessis, G.D.; Letard, V.; Ligozat, A.L.; Rosset, S. Comparing system-response retrieval models for open-domain and casual conversational agent. In Proceedings of the Second Workshop on Chatbots and Conversational Agent Technologies (WOCHAT@ IVA2016), Los Angeles, CA, USA, 20 September 2016. [Google Scholar]
- Duplessis, G.D.; Letard, V.; Ligozat, A.L.; Rosset, S. Purely corpus-based automatic conversation authoring. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portoroz, Slovenia, 23–28 May 2016; pp. 2728–2735. [Google Scholar]
- Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
- McCormick, C. Word2vec Tutorial—The Skip-Gram Model. 2016. Available online: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model (accessed on 3 October 2021).
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13)-Volume 2, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Schulert, A.J.; Rogers, G.T.; Hamilton, J.A. ADM—A dialog manager. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Francisco, CA, USA, April 1985; pp. 177–183. [Google Scholar]
- Williams, J.D.; Henderson, M.; Raux, A.; Thomson, B.; Black, A.; Ramachandran, D. The dialog state tracking challenge series. AI Mag. 2014, 35, 121–124. [Google Scholar] [CrossRef] [Green Version]
- Xu, P.; Hu, Q. An end-to-end approach for handling unknown slot values in dialogue state tracking. arXiv 2018, arXiv:1805.01555. [Google Scholar]
- McTear, M. The Role of Spoken Dialogue in User—Environment Interaction. In Human-Centric Interfaces for Ambient Intelligence; Elsevier: Amsterdam, The Netherlands, 2010; pp. 225–254. [Google Scholar]
- Kobayashi, M.; Takeda, K. Information retrieval on the web. ACM Comput. Surv. (CSUR) 2000, 32, 144–173. [Google Scholar] [CrossRef]
- Abdul-Kader, S.A.; Woods, J. Question answer system for online feedable new born Chatbot. In Proceedings of the Intelligent Systems Conference (IntelliSys), London, UK, 7–8 September 2017; pp. 863–869. [Google Scholar]
- Maroengsit, W.; Piyakulpinyo, T.; Phonyiam, K.; Pongnumkul, S.; Chaovalit, P.; Theeramunkong, T. A Survey on Evaluation Methods for Chatbots. In Proceedings of the 7th International Conference on Information and Education Technology, Aizu-Wakamatsu, Japan, 29–31 March 2019; pp. 111–119. [Google Scholar]
- Santhanam, S.; Shaikh, S. Towards best experiment design for evaluating dialogue system output. arXiv 2019, arXiv:1909.10122. [Google Scholar]
- Bartl, A.; Spanakis, G. A retrieval-based dialogue system utilizing utterance and context embeddings. In Proceedings of the 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 1120–1125. [Google Scholar]
- Arora, P.; Kaushik, A.; Jones, G.J. DCU at the TREC 2019 Conversational Assistance Track. In Proceedings of the TREC, Gaithersburg, MD, USA, 13–15 November 2019. [Google Scholar]
- Kaushik, A.; Ramachandra, V.B.; Jones, G.J. DCU at the FIRE 2020 Retrieval from Conversational Dialogues (RCD) task. In Proceedings of the FIRE 2020: 12th meeting of Forum for Information Retrieval Evaluation, Hyderabad, India, 16–20 December 2020; pp. 788–805. [Google Scholar]
- Tetreault, J.; Filatova, E.; Chodorow, M. Rethinking grammatical error annotation and evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use, Los Angeles, CA, USA, 5 June 2010; pp. 45–48. [Google Scholar]
- Satav, A.G.; Ausekar, A.B.; Bihani, R.M.; Shaikh, A. A proposed natural language query processing system. Int. J. Sci. Appl. Inf. Technol. 2014, 3. Available online: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.458.8145&rep=rep1&type=pdf (accessed on 26 October 2021).
- McDonald, D.D. Natural Language Generation. Handb. Nat. Lang. Process. 2010, 2, 121–144. [Google Scholar]
- Bateman, J.; Zock, M. Natural language generation. In The Oxford Handbook of Computational Linguistics; Oxford University Press: Cambridge, UK, 2003. [Google Scholar]
- Colby, K.M. Modeling a paranoid mind. Behav. Brain Sci. 1981, 4, 515–534. [Google Scholar] [CrossRef]
- Lemon, O.; Pietquin, O. Machine learning for spoken dialogue systems. In Proceedings of the European Conference on Speech Communication and Technologies (Interspeech’07), Antwerp, Belgium, 27–31 August 2007; pp. 2685–2688. [Google Scholar]
- Inui, N.; Koiso, T.; Nakamura, J.; Kotani, Y. Fully corpus-based natural language dialogue system. In Proceedings of the Natural Language Generation in Spoken and Written Dialogue, AAAI Spring Symposium, Stanford, CA, USA, 24–26 March 2003; pp. 1–3. [Google Scholar]
- Oh, A.; Rudnicky, A. Stochastic language generation for spoken dialogue systems. In Proceedings of the ANLP-NAACL 2000 Workshop: Conversational Systems, Washington, DC, USA, 4 May 2000; pp. 27–32. [Google Scholar]
- Zhang, Z.; Takanobu, R.; Zhu, Q.; Huang, M.; Zhu, X. Recent advances and challenges in task-oriented dialog systems. In Science China Technological Sciences; Springer: Berlin/Heidelberg, Germany, 16 September 2020; pp. 1–17. [Google Scholar]
- Chen, P.; Lu, Y.; Peng, Y.; Liu, J.; Xu, Q. Identification of Students’ Need Deficiency Through a Dialogue System. In International Conference on Artificial Intelligence in Education, Proceedings of the 21st International Conference, AIED 2020, Ifrane, Morocco, 6–10 July 2020; Springer: Cham, Switzerland, 2020; pp. 59–63. [Google Scholar]
- Wen, T.H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L.M.; Su, P.H.; Ultes, S.; Young, S. A network-based end-to-end trainable task-oriented dialogue system. arXiv 2016, arXiv:1604.04562. [Google Scholar]
- Chiba, Y.; Nose, T.; Kase, T.; Yamanaka, M.; Ito, A. An analysis of the effect of emotional speech synthesis on non-task-oriented dialogue system. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia, 12–14 July 2018; pp. 371–375. [Google Scholar]
- Niculescu, A.I.; Jiang, R.; Kim, S.; Yeo, K.H.; D’Haro, L.F.; Niswar, A.; Banchs, R.E. SARA: Singapore’s automated responsive assistant, a multimodal dialogue system for touristic information. In International Conference on Mobile Web and Information Systems, Proceedings of the 11th International Conference, MobiWIS 2014, Barcelona, Spain, 27–29 August 2014; Springer: Cham, Switzerland, 2014; pp. 153–164. [Google Scholar]
- Zhang, Y.; Chen, X.; Ai, Q.; Yang, L.; Croft, W.B. Towards conversational search and recommendation: System ask, user respond. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino Italy, 22–26 October 2018; pp. 177–186. [Google Scholar]
- Vtyurina, A.; Savenkov, D.; Agichtein, E.; Clarke, C.L. Exploring conversational search with humans, assistants, and wizards. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2187–2193. [Google Scholar]
- Cahn, J. CHATBOT: Architecture, Design, & Development; University of Pennsylvania School of Engineering and Applied Science Department of Computer and Information Science: Philadelphia, PA, USA, 2017. [Google Scholar]
- Turing, A.M. Mind. Mind 1950, 59, 433–460. [Google Scholar] [CrossRef]
- Kenny, P.; Parsons, T.; Gratch, J.; Rizzo, A. Virtual humans for assisted health care. In Proceedings of the 1st International Conference on PErvasive Technologies Related to Assistive Environments, Athens, Greece, 16–18 July 2008; pp. 1–4. [Google Scholar]
- Tavarnesi, G.; Laus, A.; Mazza, R.; Ambrosini, L.; Catenazzi, N.; Vanini, S.; Tuggener, D. Learning with Virtual Patients in Medical Education. In Proceedings of the EC-TEL (Practitioner Proceedings), Leeds, UK, 3–6 September 2018. [Google Scholar]
- Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Hierarchical neural network generative models for movie dialogues. arXiv 2015, arXiv:1507.04808. [Google Scholar]
- Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
- Vinyals, O.; Le, Q. A neural conversational model. arXiv 2015, arXiv:1506.05869. [Google Scholar]
- Serban, I.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using generative hierarchical neural network models (2015). arXiv 2016, arXiv:1507.04808. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:cs.CL/1909.11942. [Google Scholar]
- Martin, L.; Muller, B.; Suárez, P.J.O.; Dupont, Y.; Romary, L.; de la Clergerie, É.V.; Seddah, D.; Sagot, B. Camembert: A tasty french language model. arXiv 2019, arXiv:1911.03894. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based model for Arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Gonen, H.; Ravfogel, S.; Elazar, Y.; Goldberg, Y. It’s not Greek to mBERT: Inducing Word-Level Translations from Multilingual BERT. arXiv 2020, arXiv:2010.08275. [Google Scholar]
- Lee, C.; Jung, S.; Kim, S.; Lee, G.G. Example-based dialog modeling for practical multi-domain dialog system. Speech Commun. 2009, 51, 466–484. [Google Scholar] [CrossRef]
- Baxter, G.J.; Blythe, R.A.; Croft, W.; McKane, A.J. Utterance selection model of language change. Phys. Rev. E 2006, 73, 046118. [Google Scholar] [CrossRef] [Green Version]
- Denkowski, M.; Lavie, A. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Scotland, UK, 30–31 July 2011; pp. 85–91. [Google Scholar]
- Duplessis, G.D.; Charras, F.; Letard, V.; Ligozat, A.L.; Rosset, S. Utterance retrieval based on recurrent surface text patterns. In European Conference on Information Retrieval, Proceedings of the 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, 8–13 April 2017; Springer: Cham, Switzerland, 2017; pp. 199–211. [Google Scholar]
- Bouziane, A.; Bouchiha, D.; Doumi, N.; Malki, M. Question answering systems: Survey and trends. Procedia Comput. Sci. 2015, 73, 366–375. [Google Scholar] [CrossRef] [Green Version]
- Yang, Y.; Yih, W.t.; Meek, C. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 19–21 September 2015; pp. 2013–2018. [Google Scholar]
- Oniani, D.; Wang, Y. A qualitative evaluation of language models on automatic question-answering for COVID-19. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Virtual Event, USA, 21–24 September 2020; pp. 1–9. [Google Scholar]
- Piccinini, G. Turing’s rules for the imitation game. Minds Mach. 2000, 10, 573–582. [Google Scholar] [CrossRef]
- Sethi, F. FAQ (Frequently Asked Questions) ChatBot for Conversation. Authorea Prepr. 2020, 8. [Google Scholar] [CrossRef]
- Rahman, J. Implementation of ALICE Chatbot as Domain Specific Knowledge Bot for BRAC U (FAQ Bot). Ph.D. Thesis, BRAC University, Dhaka, Bangladesh, 2012. [Google Scholar]
- Lee, K.; Jo, J.; Kim, J.; Kang, Y. Can Chatbots Help Reduce the Workload of Administrative Officers?-Implementing and Deploying FAQ Chatbot Service in a University. In International Conference on Human-Computer Interaction, Proceedings of the 21st International Conference, HCII 2019, Orlando, FL, USA, 26–31 July 2019; Springer: Cham, Switzerland, 2019; pp. 348–354. [Google Scholar]
- Van Rousselt, R. Natural language processing bots. In Pro Microsoft Teams Development; Springer: Berlin/Heidelberg, Germany, 2021; pp. 161–185. [Google Scholar]
- Nagarhalli, T.P.; Vaze, V.; Rana, N. A Review of Current Trends in the Development of Chatbot Systems. In Proceedings of the 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 6–7 March 2020; pp. 706–710. [Google Scholar]
- Su, M.H.; Wu, C.H.; Huang, K.Y.; Hong, Q.B.; Wang, H.M. A chatbot using LSTM-based multi-layer embedding for elderly care. In Proceedings of the International Conference on Orange Technologies (ICOT), Singapore, 8–10 December 2017; pp. 70–74. [Google Scholar]
- Kuligowska, K. Commercial chatbot: Performance evaluation, usability metrics and quality standards of embodied conversational agents. Prof. Cent. Bus. Res. 2015, 2, 1–16. [Google Scholar] [CrossRef]
- Baby, C.J.; Khan, F.A.; Swathi, J. Home automation using IoT and a chatbot using natural language processing. In Proceedings of the Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, India, 21–22 April 2017; pp. 1–6. [Google Scholar]
- Lee, C.H.; Chen, T.Y.; Chen, L.P.; Yang, P.C.; Tsai, R.T.H. Automatic question generation from children’s stories for companion chatbot. In Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI), Salt Lake City, UT, USA, 6–9 July 2018; pp. 491–494. [Google Scholar]
- Pichponreay, L.; Kim, J.H.; Choi, C.H.; Lee, K.H.; Cho, W.S. Smart answering Chatbot based on OCR and Overgenerating Transformations and Ranking. In Proceedings of the Eighth International Conference on Ubiquitous and Future Networks (ICUFN), Vienna, Austria, 5–8 July 2016; pp. 1002–1005. [Google Scholar]
- D’silva, G.M.; Thakare, S.; More, S.; Kuriakose, J. Real world smart chatbot for customer care using a software as a service (SaaS) architecture. In Proceedings of the International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), Mobile, Tamil Nadu, India, 10–11 February 2017; pp. 658–664. [Google Scholar]
- Choi, H.; Hamanaka, T.; Matsui, K. Design and implementation of interactive product manual system using chatbot and sensed data. In Proceedings of the IEEE 6th Global Conference on Consumer Electronics (GCCE), Nagoya, Japan, 24–27 October 2017; pp. 1–5. [Google Scholar]
- Latif, S.; Cuayáhuitl, H.; Pervez, F.; Shamshad, F.; Ali, H.S.; Cambria, E. A Survey on Deep Reinforcement Learning for Audio-Based Applications. arXiv 2021, arXiv:2101.00240. [Google Scholar]
- Kaushik, A.; Loir, N.; Jones, G.J. Multi-view conversational search interface using a dialogue-based agent. In European Conference on Information Retrieval, Proceedings of the 43rd European Conference on IR Research, ECIR 2021, Virtual Event, 28 March–1 April 2021; Springer: Cham, Switzerland, 2021; pp. 520–524. [Google Scholar]
- Bocklisch, T.; Faulkner, J.; Pawlowski, N.; Nichol, A. Rasa: Open source language understanding and dialogue management. arXiv 2017, arXiv:1712.05181. [Google Scholar]
- Krasakis, A.M.; Aliannejadi, M.; Voskarides, N.; Kanoulas, E. Analysing the effect of clarifying questions on document ranking in conversational search. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, Virtual Event, Norway, 14–17 September 2020; pp. 129–132. [Google Scholar]
- Google Dialogflow. Available online: https://dialogflow.cloud.google.com/ (accessed on 3 October 2021).
- Singh, A.; Ramasubramanian, K.; Shivam, S. Introduction to Microsoft Bot, RASA, and Google Dialogflow. In Building an Enterprise Chatbot; Springer: Berlin/Heidelberg, Germany, 2019; pp. 281–302. [Google Scholar]
- Dialogflow. Available online: https://dialogflow.com/docs (accessed on 3 October 2021).
- Intent. Available online: https://cloud.google.com/dialogflow/es/docs/intents-overview (accessed on 26 October 2021).
- Lopatovska, I.; Rink, K.; Knight, I.; Raines, K.; Cosenza, K.; Williams, H.; Sorsche, P.; Hirsch, D.; Li, Q.; Martinez, A. Talk to me: Exploring user interactions with the Amazon Alexa. J. Librariansh. Inf. Sci. 2019, 51, 984–997. [Google Scholar] [CrossRef]
- Harms, J.G.; Kucherbaev, P.; Bozzon, A.; Houben, G.J. Approaches for dialog management in conversational agents. IEEE Internet Comput. 2018, 23, 13–22. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; Jurafsky, D. Deep reinforcement learning for dialogue generation. arXiv 2016, arXiv:1606.01541. [Google Scholar]
- Zhao, T.; Eskenazi, M. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. arXiv 2016, arXiv:1606.02560. [Google Scholar]
- Scheffler, K.; Young, S. Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning. In Proceedings of the HLT, San Diego, CA, USA, 24–27 March 2002; Volume 2. [Google Scholar]
- Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.N.; Ahmed, F.; Deng, L. Towards end-to-end reinforcement learning of dialogue agents for information access. arXiv 2016, arXiv:1609.00777. [Google Scholar]
- Kaushik, A.; Jones, G.J. A Conceptual Framework for Implicit Evaluation of Conversational Search Interfaces. arXiv 2021, arXiv:2104.03940. [Google Scholar]
- Jurcıcek, F.; Keizer, S.; Gašic, M.; Mairesse, F.; Thomson, B.; Yu, K.; Young, S. Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk. In Proceedings of the INTERSPEECH, Florence, Italy, 27–31 August 2011; Volume 11. [Google Scholar]
- Bradeško, L.; Mladenić, D. A survey of chatbot systems through a loebner prize competition. In Proceedings of the Slovenian Language Technologies Society Eighth Conference of Language Technologies, Ljubljana, Slovenia, 8–12 October 2012; Institut Jožef Stefan Ljubljana: Ljubljana, Slovenia, 2012; pp. 34–37. [Google Scholar]
- Simpson, A.; Eraser, N.M. Black box and glass box evaluation of the SUNDIAL system. In Proceedings of the Third European Conference on Speech Communication and Technology, Berlin, Germany, 21–23 September 1993. [Google Scholar]
- Holmes, S.; Moorhead, A.; Bond, R.; Zheng, H.; Coates, V.; McTear, M. WeightMentor: A new automated chatbot for weight loss maintenance. In Proceedings of the 32nd International BCS Human Computer Interaction Conference 32, Belfast, UK, 4–6 July 2018; pp. 1–5. [Google Scholar]
- Lowe, R.; Noseworthy, M.; Serban, I.V.; Angelard-Gontier, N.; Bengio, Y.; Pineau, J. Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv 2017, arXiv:1708.07149. [Google Scholar]
- Kannan, A.; Vinyals, O. Adversarial evaluation of dialogue models. arXiv 2017, arXiv:1701.08198. [Google Scholar]
- Lowe, R.; Serban, I.V.; Noseworthy, M.; Charlin, L.; Pineau, J. On the evaluation of dialogue systems with next utterance classification. arXiv 2016, arXiv:1605.05414. [Google Scholar]
- Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B. A diversity-promoting objective function for neural conversation models. arXiv 2015, arXiv:1510.03055. [Google Scholar]
- Li, B.; Han, L. Distance weighted cosine similarity measure for text classification. In International Conference on Intelligent Data Engineering and Automated Learning, Proceedings of the 14th International Conference, IDEAL 2013, Hefei, China, 20–23 October 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
- Peng, B.; Li, C.; Zhang, Z.; Zhu, C.; Li, J.; Gao, J. RADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems. arXiv 2020, arXiv:2012.14666. [Google Scholar]
- Tao, C.; Mou, L.; Zhao, D.; Yan, R. Ruber: An unsupervised method for automatic evaluation of open-domain dialog systems. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Xu, X.; Dušek, O.; Konstas, I.; Rieser, V. Better conversations by modeling, filtering, and optimizing for coherence and diversity. arXiv 2018, arXiv:1809.06873. [Google Scholar]
- Chen, S.F.; Beeferman, D.; Rosenfeld, R. Evaluation Metrics for Language Models; Carnegie Mellon University: Pittsburgh, PA, USA, 1980. [Google Scholar]
- Ritter, A.; Cherry, C.; Dolan, B. Unsupervised modeling of twitter conversations. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 2–4 June 2010; pp. 172–180. [Google Scholar]
- Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.Y.; Gao, J.; Dolan, B. A neural network approach to context-sensitive generation of conversational responses. arXiv 2015, arXiv:1506.06714. [Google Scholar]
- Galley, M.; Brockett, C.; Sordoni, A.; Ji, Y.; Auli, M.; Quirk, C.; Mitchell, M.; Gao, J.; Dolan, B. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv 2015, arXiv:1506.06863. [Google Scholar]
- Almeida, F.; Xexéo, G. Word embeddings: A survey. arXiv 2019, arXiv:1901.09069. [Google Scholar]
- Rudkowsky, E.; Haselmayer, M.; Wastian, M.; Jenny, M.; Emrich, Š.; Sedlmair, M. More than bags of words: Sentiment analysis with word embeddings. Commun. Methods Meas. 2018, 12, 140–157. [Google Scholar] [CrossRef] [Green Version]
- Corley, C.; Mihalcea, R. Measures of text semantic similarity. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence, Ann Arbor, MI, USA, 30 June 2005. [Google Scholar]
- Lintean, M.; Rus, V. Measuring semantic similarity in short texts through greedy pairing and word semantics. In Proceedings of the Twenty-Fifth International FLAIRS Conference, Marco Island, FL, USA, 23–25 May 2012. [Google Scholar]
- Rus, V.; Lintean, M. An optimal assessment of natural language student input using word-to-word similarity metrics. In International Conference on Intelligent Tutoring Systems, Proceedings of the 11th International Conference, ITS 2012, Chania, Crete, Greece, 14–18 June 2012; Springer: Cham, Switzerland, 2012; pp. 675–676. [Google Scholar]
- Foltz, P.W.; Kintsch, W.; Landauer, T.K. The measurement of textual coherence with latent semantic analysis. Discourse Process. 1998, 25, 285–307. [Google Scholar] [CrossRef]
- Landauer, T.K.; Dumais, S.T. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 1997, 104, 211. [Google Scholar] [CrossRef]
- Mitchell, J.; Lapata, M. Vector-based models of semantic composition. In Proceedings of the ACL-08: HLT, Columbus, OH, USA, 16–18 June 2008; pp. 236–244. [Google Scholar]
- Forgues, G.; Pineau, J.; Larchevêque, J.M.; Tremblay, R. Bootstrapping dialog systems with word embeddings. In Proceedings of the Nips, Modern Machine Learning and Natural Language Processing Workshop, Montreal, QC, Canada, 12–13 December 2014; Volume 2. [Google Scholar]
- Hardalov, M.; Koychev, I.; Nakov, P. Machine Reading Comprehension for Answer Re-Ranking in Customer Support Chatbots. Information 2019, 10, 82. [Google Scholar] [CrossRef] [Green Version]
- Dhyani, M.; Kumar, R. An intelligent Chatbot using deep learning with Bidirectional RNN and attention model. Mater. Today Proc. 2021, 34, 817–824. [Google Scholar] [CrossRef]
- Liu, Q.; Huang, J.; Wu, L.; Zhu, K.; Ba, S. CBET: Design and evaluation of a domain-specific chatbot for mobile learning. Univers. Access Inf. Soc. 2020, 19, 655–673. [Google Scholar] [CrossRef]
- Callison-Burch, C.; Osborne, M.; Koehn, P. Re-evaluation the role of bleu in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, 5–6 April 2006. [Google Scholar]
- Lin, C.Y.; Och, F. Looking for a few good metrics: ROUGE and its evaluation. In Proceedings of the Ntcir Workshop, Tokyo, Japan, 2–4 June 2004. [Google Scholar]
- Dutta, S.; Klakow, D. Evaluating a neural multi-turn chatbot using BLEU score. Univ. Saarl. 2019, 10, 1–12. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Adiwardana, D.; Luong, M.T.; So, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al. Towards a human-like open-domain chatbot. arXiv 2020, arXiv:2001.09977. [Google Scholar]
- Jena, G.; Vashisht, M.; Basu, A.; Ungar, L.; Sedoc, J. Enterprise to computer: Star trek chatbot. arXiv 2017, arXiv:1708.00818. [Google Scholar]
- Engelbrecht, K.P.; Gödde, F.; Hartard, F.; Ketabdar, H.; Möller, S. Modeling user satisfaction with hidden Markov models. In Proceedings of the SIGDIAL 2009 Conference, London, UK, 11–12 September 2009; pp. 170–177. [Google Scholar]
- Georgila, K.; Henderson, J.; Lemon, O. User simulation for spoken dialogue systems: Learning and evaluation. In Interspeech; Citeseer: Pittsburgh, PA, USA, 2006; pp. 1065–1068. [Google Scholar]
- Schatzmann, J.; Weilhammer, K.; Stuttle, M.; Young, S. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl. Eng. Rev. 2006, 21, 97–126. [Google Scholar] [CrossRef] [Green Version]
- Kreyssig, F.; Casanueva, I.; Budzianowski, P.; Gasic, M. Neural user simulation for corpus-based policy optimisation for spoken dialogue systems. arXiv 2018, arXiv:1805.06966. [Google Scholar]
- Holmes, S.; Moorhead, A.; Bond, R.; Zheng, H.; Coates, V.; McTear, M. Usability testing of a healthcare chatbot: Can we use conventional methods to assess conversational user interfaces? In Proceedings of the 31st European Conference on Cognitive Ergonomics, Belfast, UK, 10–13 September 2019; pp. 207–214. [Google Scholar]
- Lewis, J.R.; Sauro, J. The factor structure of the system usability scale. In International Conference on Human Centered Design, Proceedings of the First International Conference, HCD 2009, Held as Part of HCI International 2009, San Diego, CA, USA, 19–24 July 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 94–103. [Google Scholar]
- Schrepp, M. User experience questionnaire handbook. In All You Need to Know to Apply the UEQ Successfully in Your Project; UEQ: Weyhe, Germany, 2015. [Google Scholar]
- Sharma, M.; Kaushik, A.; Kumar, R.; Rai, S.K.; Desai, H.H.; Yadav, S. Communication is the universal solvent: Atreya bot—An interactive bot for chemical scientists. arXiv 2021, arXiv:2106.07257. [Google Scholar]
- Hajdinjak, M.; Mihelič, F. The PARADISE evaluation framework: Issues and findings. Comput. Linguist. 2006, 32, 263–272. [Google Scholar] [CrossRef]
- Peras, D. Chatbot evaluation metrics. In Proceedings of the 36th International Scientific Conference on Economic and Social Development: Book of Proceedings, Zagreb, Hvatska, 14–15 December 2018; pp. 89–97. [Google Scholar]
- Xu, Q.; Huang, G.; Yuan, Y.; Guo, C.; Sun, Y.; Wu, F.; Weinberger, K. An empirical study on evaluation metrics of generative adversarial networks. arXiv 2018, arXiv:1806.07755. [Google Scholar]
- Guo, F.; Metallinou, A.; Khatri, C.; Raju, A.; Venkatesh, A.; Ram, A. Topic-based evaluation for conversational bots. arXiv 2018, arXiv:1801.03622. [Google Scholar]
- DeVault, D.; Leuski, A.; Sagae, K. Toward learning and evaluation of dialogue policies with text examples. In Proceedings of the SIGDIAL 2011 Conference, Portland, OR, USA, 17–18 June 2011; pp. 39–48. [Google Scholar]
- Gandhe, S.; Traum, D. A semi-automated evaluation metric for dialogue model coherence. In Situated Dialog in Speech-Based Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 2016; pp. 217–225. [Google Scholar]
- Serban, I.V.; Lowe, R.; Henderson, P.; Charlin, L.; Pineau, J. A survey of available corpora for building data-driven dialogue systems: The journal version. Dialogue Discourse 2018, 9, 1–49. [Google Scholar] [CrossRef]
- Gasic, M.; Breslin, C.; Henderson, M.; Kim, D.; Szummer, M.; Thomson, B.; Tsiakoulis, P.; Young, S. POMDP-based dialogue manager adaptation to extended domains. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, 22–24 August 2013; pp. 214–222. [Google Scholar]
- Qu, C.; Yang, L.; Croft, W.B.; Trippas, J.R.; Zhang, Y.; Qiu, M. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 989–992. [Google Scholar]
- Choi, E.; He, H.; Iyyer, M.; Yatskar, M.; Yih, W.T.; Choi, Y.; Liang, P.; Zettlemoyer, L. Quac: Question answering in context. arXiv 2018, arXiv:1808.07036. [Google Scholar]
- Reddy, S.; Chen, D.; Manning, C.D. Coqa: A conversational question answering challenge. Trans. Assoc. Comput. Linguist. 2019, 7, 249–266. [Google Scholar] [CrossRef]
- Kim, S.; D’Haro, L.F.; Banchs, R.E.; Williams, J.D.; Henderson, M. The fourth dialog state tracking challenge. In Dialogues with Social Robots; Springer: Berlin/Heidelberg, Germany, 2017; pp. 435–449. [Google Scholar]
- Pavlopoulos, J.; Thain, N.; Dixon, L.; Androutsopoulos, I. Convai at semeval-2019 task 6: Offensive language identification and categorization with perspective and bert. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 571–576. [Google Scholar]
- Ram, A.; Prasad, R.; Khatri, C.; Venkatesh, A.; Gabriel, R.; Liu, Q.; Nunn, J.; Hedayatnia, B.; Cheng, M.; Nagar, A.; et al. Conversational ai: The science behind the alexa prize. arXiv 2018, arXiv:1801.03604. [Google Scholar]
Evaluation Method | Model Used | Results | Evaluation Criteria |
---|---|---|---|
Survey of WeightMentor App [128] | SUS, UEQ and CUQ Questionnaires | Median SUS score , Participant scores above +0.8, CUQ mean score | Usability study on WeightMentor implemented through DialogFlow |
Automatic dialog Evaluation Model (ADEM) [129] | RNN | Pearson’s correlation = 0.41 on the utterance level and at 0.954 on the system level. | Quality of dialog responses |
Adversarial Evaluation [130] | Data generated artificially by the generator and real data; GAN; Adversarial training where model tries to discriminate whether the data is real or artificial | Discriminator accuracy of 62.5% | Naturalness of dialog |
Evaluation by Next Utterance Classification [131] | Datasets: Ubuntu Corpus, SubTle Corpus, Twitter Corpus. Dual-Encoder(DE), includes RNNs and LSTMs. ANN trained to find human correlation. | and values were calculated. The highest value was percent for the Twitter Corpus. The highest R@2 was % for Twitter Corpus. | Dialog strategies and next utterance |
Lexical Diversity [132] | Maximum Mutual Information (MMI) Models that generate more diverse and appropriate responses | Improved quality as measured by BLEU and human evaluation. | Diversity of responses for conversational system. |
Distance Weighted Cosine Similarity Measure [133] | Vector Space Model (VSM), Centroid classification algorithm. Datasets used were 20 Newspapers, Reuters52c, Sector | Micro-average of the F1 score (MicF1) is highest for Reuters52c (90.3816). | Cosine similarity for text classification |
RADDLE [134] | Fine-tuned GPT-2, Domain Aware Multi-Decoder (DAMD), SOLOIST | SOLOIST performs the best with 10 more average points than GPT-2. | Generalization ability of Task-oriented dialog systems |
Ruber [135] | Seq2Seq with attention, Retrieval from Human Judgements | Ruber metric shows to have near human correlation. | Correlation with human judgement for Open domain dialog systems |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yadav, S.; Kaushik, A. Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics. Knowledge 2022, 2, 55-87. https://doi.org/10.3390/knowledge2010004
Yadav S, Kaushik A. Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics. Knowledge. 2022; 2(1):55-87. https://doi.org/10.3390/knowledge2010004
Chicago/Turabian StyleYadav, Sargam, and Abhishek Kaushik. 2022. "Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics" Knowledge 2, no. 1: 55-87. https://doi.org/10.3390/knowledge2010004
APA StyleYadav, S., & Kaushik, A. (2022). Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics. Knowledge, 2(1), 55-87. https://doi.org/10.3390/knowledge2010004