Do You Ever Get Off Track in a Conversation? The Conversational System’s Anatomy and Evaluation Metrics

: Conversational systems are now applicable to almost every business domain. Evaluation is an important step in the creation of dialog systems so that they may be readily tested and prototyped. There is no universally agreed upon metric for evaluating all dialog systems. Human evaluation, which is not computerized, is now the most effective and complete evaluation approach. Data gathering and analysis are evaluation activities that need human intervention. In this work, we address the many types of dialog systems and the assessment methods that may be used with them. The beneﬁts and drawbacks of each sort of evaluation approach are also explored, which could better help us understand the expectations associated with developing an automated evaluation system. The objective of this study is to investigate conversational agents, their design approaches and evaluation metrics. This approach can help us to better understand the overall process of dialog system development, and future possibilities to enhance user experience. Because human assessment is costly and time consuming, we emphasize the need of having a generally recognized and automated evaluation model for conversational systems, which may signiﬁcantly minimize the amount of time required for analysis.


Introduction
A dialog system is any conversational agent whose purpose is communicate with a human through a text or voice interface. Dialog systems are extensively used in a number of industries for a number of different sectors such as customer service [1], information seeking [2], conversational therapy [3], diagnosis [4], and more. Dialog systems are beneficial as they can support multiple users without any decrement in performance. They are also robust and easily adaptable. Providing an automated response reduces the chances of human error. Users may prefer to interact with a dialog system as it does not judge or misuse the information disclosed by the user [5]. Dialog systems also eliminate the factor of human frustration in repeated use of the system. Thus, they prove beneficial for both the business and the end user in terms of increasing full-time availability. Rule-based dialog systems such as ALICE [6] and ELIZA [7] make of use of matching pairs to output dialog. Both machine learning and deep learning tools are also widely used in natural language processing to implement such systems [8,9]. Dialog systems are considered valuable because they are approachable, improve customer experience, manage large number of customers, and reduce operational costs. Companies are investing a lot into dialog systems because of this reason. Dialog systems are available round the clock to improve overall customer experience. People also usually prefer to interact with dialog systems over humans [10]. McTear et al. [11] explain that dialog systems usually construct the dialog in turns. Each turn can be defined by one or more responses from each user. Two consecutive turns between each user can be referred to as an exchange, and multiple exchanges could be referred to as a dialog. Each turn taken by a user contribute a part to the dialog. A turn can consist of a single word, as well as multiple sentences. Depending on the task at hand, the dialogues may be short or long. For example, shorter dialogues may be used with mobile assistants, longer dialogues may be needed where a lot of information is required to be verified, such as travel management and financial agents.
Some studies provide a broad classification of dialog systems [12] as task-oriented dialog systems and non-task oriented dialog systems. Task-oriented dialog systems assist the user in completing a task or achieving a goal. Conversational search-based systems [13], which provide an agent to assist in search tasks, are a subcategory of task-oriented systems. Non-task oriented dialog systems emulate a more human-like conversational flow. Question-answering systems are trained on a large corpus of domain-specific knowledge so that they can provide responses [2]. Task-oriented dialog systems are employed by businesses for purposes such as interactive self-service, automatic diagnosis [14], scheduling appointments, ordering food, technical support, educational help, etc. Hoy [15] mention popular examples of widely available task-oriented virtual assistants such as Apple Siri [16], Microsoft Cortana [17], Alexa [18], and more. Conversational systems can aim to answer user queries on websites or carry out 'chats' with the users. The focus rests on providing a natural conversational pattern so the system appears human-like. Kaushik et al. [19] introduce a conversational search system with a dialog agent that assists users in complex information seeking processes. QA systems can be used for tasks such as answering questions about university admissions [20], quiz generation [21], customer service [22], and more. Search-based conversational systems provide assistance to the user in completing a search task. Figure 1 shows classification of dialog systems. The categorization is done on several features of the system such as: 1.
Interaction mode: The dialog system could receive input from the user in the form of text or voice.

2.
Design approach: The system may be modeled to provide answers about a particular topic (closed domain) or a range of topics (open-domain).

3.
Search-based: Search-based systems assist the user in fulfilling their information needs. 4. Goals: The system may be designed for task completion (task-oriented), or carrying on conversations (non-task oriented). 5.
Knowledge domain: The response may be selected by following a pre-defined rule (rule-based), retrieving pre-defined responses (retrieval based), or generating new responses entirely (generative models). Irrespective of the type of dialog system, the most important step in its development is evaluation. Evaluation provides feedback on factors of the system such as success rate, user satisfaction, contextual validity, etc. thus allowing us to grow and develop a better system to provide adaptability. Evaluation of dialog systems is a crucial step in their development, deployment, and improvement. There is no standard way to define the efficiency of a dialog system. As discussed in the previous paragraph, task-oriented dialog systems may score efficiency by successful completion of tasks. Non-task oriented systems aim to provide a human-like conversations, so the naturalness of dialogues may be prioritised. A reliable factor to measure the efficiency of a dialog system is user satisfaction. A system may perform well on empirical evaluation, but might ultimately not meet the user's expectations. Higher user satisfaction will result in higher acceptance and use of the system. The metrics for evaluating dialog systems also vary based on different application domains. Chatbots that provide services such as reservations, bookings, etc. can easily be evaluated with quantitative metrics such as task success rate [12], BLEU [23], and ROUGE [24]. Conversational agents, such as therapy bots, may require human evaluation [3].
A broad classification of evaluation metrics is empirical evaluation metrics and usability-based evaluation metrics. Empirical evaluation metrics are quantitative and calculated on factors such as correctness of the response based on a ground truth response [25]. Examples of empirical evaluation metrics are provided by Gunasekara et al. [26] and include Success rate, Precision, Recall, BLEU [23], ROUGE [24], etc. Usability metrics, on the other hand calculate the efficiency of the system based on user satisfaction. Hara et al. [27] propose an estimation method of user satisfaction for a spoken dialog system. The model was trained on a huge dataset and user evaluation was conducted in a real environment. The model achieved a classification accuracy of 94.7%. Yang et al. [28] make use of collaborative filtering to predict user satisfaction in spoken dialog systems. The evaluation process varies in accordance with the type of dialogue system. Task oriented systems can be evaluated with empirical metrics. QA systems can also be evaluated with quantitative metrics such as precision and recall. Conversational search systems require both qualitative and quantitative evaluation methods. This study aims to discuss the various methods available for evaluation of dialog systems. We discuss the various criteria that are considered for evaluation and an approach for automation of the evaluation process.

Motivation
The motivation of this study is to highlight evaluation methods for dialogue systems as they vary according to their task. As there is no widely accepted evaluation metric, the evaluation task is designed specially for the particular projects. This is a time-consuming process and the need for a well-designed evaluation metric which generalizes well to unknown systems is apparent. As mentioned in the previous section, evaluation is categorized into empirical methods and usability metrics. There is a significant trade-off between the application of empirical metrics and the metrics based on user satisfaction. Empirical metrics are quick and easy to use, with very little human effort required. However, they do not provide an accurate estimation of human judgement of the system. Empirical evaluation also involves training large models on large datasets, thus increasing the computational use and costs. User evaluation, on the contrary, provides a better idea of the user satisfaction with the system. However, it is very time-intensive and costly. Evaluation of task-oriented chatbots can be performed using empirical metrics and human evaluation. Empirical evaluation methods are usually automated and do not require human intervention. Only an initial evaluation script is required. However, to be able to provide an accurate picture of the performance of the chatbot, the empirical metric must provide results that correlate strongly to human judgement. This has shown not be the case for several metrics, thus greatly impacting their usability.
Human evaluation is a more reliable alternative to empirical methods. However, there are certain other shortcomings with human evaluation. These include the costs associated with obtaining human judgements and lack of a standardized protocol for human evaluation. The development of datasets for benchmark evaluation and analysis of the evaluation are tasks that depend on human effort. These responses can then be used for comparative analysis with model-generated responses. Manually annotating responses takes up a lot of time and human effort, thereby increasing costs. It may also be difficult to find specialists for the field in which the dialog system is applicable. The lack of standardization makes it difficult to compare different systems. There is also a gap between the usability of empirical evaluation metrics available for task-oriented and conversational systems. An evaluation metric that might work well for task oriented systems may not translate well for conversational agents [29,30]. Empirical evaluation metrics provide an objective and quantitative measure of performance of dialog systems. They represent the various functions of a system through mathematical equations. They are straightforward and easy to implement, but do not approximate well to human judgements [25]. Human evaluation provides a more subjective interpretation of the system's performance. The evaluation is carried out from the user's perspective by collecting responses about the various aspects of the dialogue system. In this paper, we explore the options in evaluation methods which do not require too much human interaction but provide a better picture of the system's performance. This approach attempts to find some middle ground which satisfies both evaluation types.
For this study, Google Scholar was used to perform searches of relevant keywords. The following keywords were searched: Dialogue systems, Evaluation metrics, Discourse models, Conversational AI, Chatbots, User evaluation dialog systems, Task-oriented dialog system, Search-based dialog system, Conversational dialog system, Dialog metric, Conversational search, Human centered computing. The citations for the top five results of various keywords is shown in Figure 2. We noted and averaged the citations for each keyword, and selected the results with a citation number greater than the average. The subsequent searches were done pertaining to the specific evaluation methods. The following research questions aim to be answered by this review study:

1.
What are the evaluation methods available for Dialog systems based on the structure of the dialog? 2.
What is the requirement of an automated evaluation method for testing the usability of dialog systems?

Structural Flow of the Article
This review paper aims to encompass the various types of dialog systems and the evaluation methods that can be used to better quantify and improve the systems. A brief overview of structure of the article is given in Figure 3. The article is structured as follows: The components of dialog systems are discussed in Section 4. Sections 5-7 provide a brief discussion of task-oriented, conversational, and question-answering dialog systems respectively. Section 8 classifies dialog systems on the basis of the logic component, that is, rule-based systems and AI based systems. Section 9 covers the evaluation of dialog systems. The section is divided into empirical metrics and user-based evaluation. The usability based metrics are discussed separately for task-oriented and conversational agents. Section 10 provides an overview of the various datasets and evaluation challenges currently being conducted in the field of dialog systems.

Components of Dialog Systems
Dialogue systems serve as an interface with which the user can interact in a natural manner. There are several components in the dialog system architecture that help model the dialog as a sequence of actions. It is very helpful to understand the functions of the various components of a dialog system, as the impact of making changes in the components can be mapped to the final product. Arora et al. [31] propose the following components of a dialogue system: Input interface, Natural language understanding, Dialogue manager, Domain specific component, and Response generator. Some of these components vary for different dialogue systems, depending on application area. Each component in the system provides an input for the next one. The query provided by the user serves as the input of the NLU unit. The NLU unit attempts to understand the input and corresponds with the dialogue manager. The response generator corresponds with the domain-specific component to provide an output to the user query. The conceptual workflow of a dialogue system and its various components is shown in Figure 4.

Conversational Interface
The input to the dialog system could be in the form of text or speech. A conversational interface serves to provide the user with the ability to interact with the dialogue system in a natural manner. A dialog system may be text-based, speech-based or multimodal. In a text-based system, an interface with a chat option will be available. Spoken dialog systems accept input in the form of speech over one or multiple turns [32]. A multimodal dialog system will accept input in more than one form, such as speech, text, pen, and body movement [33]. A conversational user interface (CUI) is a digital interface that facilitates interaction with a dialog system in a convenient manner. Klopfenstein et al. [34] demonstrate several areas in which conversational interfaces are being studied, such as natural language processing, human-computer interaction, usability of the interface, and so on. A CUI should be as user-friendly as possible, as it follows the principles of human to human conversation [11]. The user does not need to worried about the model used or other functionality. The user is provided a simple interface to answer and ask questions. There are several methods available to develop a user interface. Some of the most relevant parameters from an evaluation perspective would be user experience, time taken, and user frustration while using the system [35].

Natural Language Understanding
Natural Language Understanding (NLU) is a tool-kit to help an assistant to understand the input. It helps to extract structured data in the form of intents and entities from unstructured human language. Intents can be understood as labels that represent the overall goal of the user's input. Entities are pieces of information that an assistant may need in a certain context. NLU consists of three sub-tasks which are intent classification, domain classification, and slot identification [36]. Slot-pair pairs are extracted from the current user utterance. Some of the common functions associated with NLU are listed as follows:

1.
Intents: Intent classification is an important component of natural language understanding (NLU) which classifies and labels the input given by the user to assign it a specific goal [37]. It enables the dialog system to understand user requests. For example, the user may input a question such as "What time does the store open?". The intent for this particular request is most likely "Opening Times", which can then allow the system to answer with the opening hours of the store. An intent represents a mapping between what a user says and what action should be taken by the software.

2.
Named entity recognition: Named entity recognition (NER) is the task of identifying named entities in the text [38]. Entities are pre-defined categories like names, locations, quantities, expressions of times, etc. Entity extraction enables the system to extract information from the text and helps in organising in text. For example, the name of a store, location of a venue, and the fee of a particular service could be considered entities. An entity represents concepts that are often specific to domain as a way of mapping natural language phrases to canonical phrases that capture their meaning.

3.
Pattern Matching: Pattern matching helps to match the input obtained from the user with the database and try to obtain an appropriate response [39]. Lee et al. [40] list the algorithms available for pattern matching such as fuzzy string matching, regular expressions [41], rule-based matching, token-based matching, etc.

4.
Parsing: Text parsing is the process of determining the syntactic structure of a text. It separates a given corpus into smaller components based on some rules [42]. Parsing algorithms parse the text in accordance to the predefined rule of algorithms such as left-right and bottom-up algorithms. These algorithms learn to recognize strings and assign syntactic structures to the strings [43]. Figure 5 shows an example of a parsing in action. The syntactic structure of the sentence 'The message is the answer' is obtained by dividing its constituent words according to their grammatical function. 5.
TF-IDF: Term frequency-inverse document frequency (TF-IDF) weight measures the importance of a word to the document it belongs to in a corpus. It converts user input into vectors and finds similarity with documents. The system returns high priority results from the chatbot by comparing higher cosine similarity with TF-IDF. Cosine Similarity measures the content-based similarity between two vectors which represent the text summary and reference system summary in the vector space. TF-IDF models can also serve as similarity metrics for evaluation of dialogue [44,45]. 6.
Word2Vec: Word2Vec is a two-layered neural network used to produce word embedding [46][47][48]. It is a similar approach to TF-IDF as it also processes the text corpus into vectors. The text corpus is turned into a numerical form and plotted into a vector space to create a knowledge base.

Dialog Manager
A dialog is a written or spoken conversational exchange between two or more entities. A dialog manager keeps track of history in memory with states. It is also responsible for generating responses and providing a conversational flow [49]. Dialog planning enables the system to be able to manage the conversation by understanding the context. The system should be able to manage a conversation even when there is a switch of context in the dialog. The dialog manager also takes into account dialog history while selecting responses that feel more natural and user-friendly. For example, the user might say "I need to order a copy of ' To kill a mockingbird' in e-book" and the system might record the order. The user might then say "change it to hard copy". The system should be able to correctly interpret from context that the user requires a hard copy of the same book. The dialog manager receives its input from the natural language understanding components, corresponds with the logic or knowledge component, and passes the output to the natural language generation component [31]. Dialog state tracking (DST) is used to estimate the current belief system of a dialog taking the preceding conversation into account [50]. The current belief state encodes the user's intent and the preceding dialogues to better handle misunderstandings. Xu and Hu [51] explain the use of end-to-end models to better handle unknown slot values in dialog state tracking. A higher probability may be assigned to an entity based on the dialogues in the previous turns. The dialog manager learns the strategy. The input of the strategy module is the current belief as computed by the DST module. The dialog manager then generates the next dialog or action in the system. The dialog control decides which action needs to be taken.
McTear [52] identify two main steps in the dialogue management process: 1. Dialogue modeling: It keeps track of the state of the dialogue.

2.
Dialogue control: It makes decision about the next step to be taken by the system.

Domain Specific Component
This component is the external database or expert system that contains the domain knowledge. Conversational search agents can work with search and information retrieval from the internet [53]. Abdul-Kader and Woods [54] propose the development of a dialog system which does not require a knowledge base, but searches the world wide web, with the help of chatterbot Python library and NLTK. Text-matching is performed to find the answers for queries and this new knowledge is then added to the structured database. Maroengsit et al. [55] thoroughly explain several methods for information extraction and processing of semantic information. The processing step makes extensive use of natural language processing (NLP). The data that is used to train the system is collected externally or from conversational dialogues, and then converted to a suitable format.
A publicly available corpus of data could be also be used to train a conversational system. The dialog system may vary significantly depending on the data on which it is trained. Knowledge that the dialog system is trained on is the heart of a dialog system. Some of the largest datasets present are the Reddit Corpus [56], with over 1.7 billion comments. The Twitter corpus [25] consists of 1.3 million conversations. The Ubuntu Corpus could be used to train a task-oriented or information retrieval system [57]. Datasets from several evaluation challenges are also made available [58,59]. However, creating a knowledge base for a very specific task would require a lot of human time and effort in engineering and annotation. The corpus might need to be annotated by human judges that could be crowd-sourced at Amazon mechanical Turk [60]. An external component may also interface with natural language query processing to extract SQL queries from natural queries [61].

Response Generator
The response generation step provides the output to the user. Natural language generation (NLG) transforms the structured data into natural language to provide an output to the user [62,63]. In order to successfully generate appropriate responses to the user query, it is important that the user intent and context of the conversation is understood comprehensively.

Rule-Based Systems
Rule-based systems follow a set of pre-defined rules or a tree-like structure to generate responses. One of the earliest rule-based systems is ELIZA [7] created in 1986 at MIT. ELIZA used pattern matching and keywords to process a set of pre-programmed rules to allow the system to give proper responses. PARRY [64] on the other hand attempts to model the mind of an actual paranoid patient with results that clear the Turing test. Section 6.1 discusses rule-based systems in more detail.

Corpus-Based Systems
Corpus based systems usually work with employing information retrieval (IR) to datasets of human-human interaction. IR can copy a user's response from a previous conversation to the current conversation. Machine learning and neural network models can also be trained to map a user utterance to a system response [65]. There is no strict adherence to hand-written rules and the functions of the system depend on the corpora [66]. Several available corpora, such as movie dialogues, conversations on chat platforms, and tweets, can serve as the training set for the model. Corpus based systems are also known as response generation systems [67], as the primary focus is the generation of a response appropriate to the user's query.

Types of Dialog Systems
In this section, we discuss the various types of dialog systems with their characteristics. The section provides an overview of task-oriented, conversational and questionanswering systems.

Task-Oriented Dialog Systems
Task-oriented systems are conversational agents developed for solving a particular task in the most efficient way possible. Task-oriented systems have a narrow focus of performing a certain task specified by the user. Zhang et al. [68] illustrate that dialog systems could be designed for specific tasks such as making reservations, inquiring about a particular business, customer support [22], teaching aid [69], etc. Task-oriented dialog systems could be text-based [70], speech-based [71], or multi-modal [72]. Task-oriented systems are trained on restricted domain knowledge and do not handle trivia questions well. The system could be trained with knowledge specific to any domain and follows a clear structure.

Dialog Structure
Task-oriented systems are usually based on domain ontology, which defines the terms in a specific area, the relationships between the terms, and an expression of the relationships in an hierarchical structure. Task-oriented systems usually encompass a narrow domain. The models for developing these bots are usually retrieval-based systems. As we discussed in Section 5.1, retrieval-based models use a repository of predefined responses.

Conversational Dialog Systems
Conversational agents are a type of non-task oriented systems which are developed either for normal chit-chatting or for performing conversational search [73,74]. Conversational agents are designed to have extended and unstructured conversations, which try to replicate the 'naturalness' of human to human interactions. Conversational dialog systems follow the structure of human conversations instead of focusing on the completion of specific tasks. These systems could be used for entertainment and other purposes such as providing psychological counseling. There is no specific task to solve so the dialog is more open-domain. The dialogues are usually longer as in case of human interactions. Cahn [75] recommend that it is preferable for a social conversational system to have a personality as they attempt to mimic human behavior. The Turing test [76] is a good judge of the effectiveness of conversational agents in imitating humans. The solution proposed by Kenny et al. [77] makes use of virtual humans as a solution for monitoring and providing healthcare for the elderly. The virtual humans are endowed with a personality, speech recognition, and other NLP functionalities. Vision and other sensors could also be added to absorb certain variables from the person's environment. This could provide beneficial assisted healthcare for the person. Tavarnes et al. [78] introduced the idea of Virtual Patient Learning (VPL), which is used in the providing training aid for healthcare students and professionals.

Modelling Conversational Dialog Systems
There are currently two main types of modelling approaches for conversational dialog systems: rule-based systems and corpus-based systems. Rule-based systems are further discussed in logic based classification Section 6.1. We will further discuss the corpusbased approaches used to model conversational agents. Corpus-based approaches could either be based on a retrieval model or a generative model. Retrieval-based models try to extract the appropriate response from its knowledge corpus. Generative models attempt to generate appropriate responses during the conversation. Intelligent conversational agents are usually build using generative models. Generative models can generate new responses from scratch. They are more likely to make mistakes, as the responses are longer. These dialog systems often attempt to pass various versions of the Turing test.

1.
Generative models: Generative models are an unsupervised approach to generate responses for the dialog system. The response generation step provides the output to the user. Natural language generation (NLG) transforms the structured data into natural language to provide an output to the user. In order to successfully generate appropriate responses to the user query, it is important that the user intent and context of the conversation are understood comprehensively. Generative models make use of deep neural networks. The dialog is generated by training the model on a large corpus of dialogues, and the most appropriate response to a given user utterance is returned. The experiment done by Serban et al. [79] to create a generative model involved two models which are Recurrent neural networks (RNNs) [80] and Hierarchical recurrent encoder decoded architecture (HRED). Vinyals et al. [81] and Serban et al. [82] demonstrate the application in dialog response generation for both LSTM and HRED architectures respectively.
Language models are probabilistic or statistical models that determine the probability of the occurrence of a word from a given corpus. Contrary to rule-based algorithms, language models attempt to understand the contextual relationships between different words in a sentence. Responses generated from a language model are thus more relevant in the given context. One of the most popularly used language models is the Bidirectional encoder representations from transformers (BERT). Devlin et al. [83] introduce BERT, which is a specific, large transformer masked language model. For a masked language model (MLM), you train the model by removing words and having the model fill in the correct word. Masked language models are useful because they are a type of contextual word embedding. Contextual word embedding allows you to have different word representations for different contextual meanings of the word. The BERT architecture uses a stack of either 12 (BASE) or 24 (LARGE) Encoders. It can be used as a general-purpose pre-trained model that is fine-tuned for specific tasks. BERT is a transferable model, thus it can be used as input to smaller, task specific models. With successful fine tuning of the model, we can achieve a very high accuracy. There are pre-trained BERT models available in over 100 languages. There are certain extensions of the BERT architecture such as RoBERTa [84], DistilBERT [85], AlBERT [86], and more. BERT models are also present in different languages such as CamemBERT (French) [87], AraBERT (Arabic) [88] and mBERT (multilingual) [89].
Drawbacks are that the model is very big, slow to train, computationally expensive, and needs to be fine-tuned for downstream tasks.

2.
Utterance Selection: The modeling of the dialog system is done as an information retrieval task in utterance selection. Candidate responses are ranked according to their relevance. The most appropriate response is retrieved from the database according to a given utterance. Figure 6 shows the extraction of values from an utterance. A probability could be calculated which then ranks the candidate utterances according to their relevance. The utterances in a dialog database help define the dialog structure [90], as the system learns to map the semantically relevant responses to the user utterances. An utterance selection model as defined by Baxter et al. [91], is a set of rules that help in filling slots for response generation. The similarity between the dialog history and candidate utterances can be measured by a similarity measure. Surface form similarity measures the similarity based on token level. Examples of these measures include METEOR [92] and Term Frequency-Inverse Document Frequency (TF-IDF) models, discussed in Section 4.2 [44,45]. The recurrent surface text pattern approach is proposed by Duplessis et al. [93] which involves a database of recurrent surface text patterns and the utterance retrieval from the database through a generalised vector-space model.

Question Answering Dialog Systems
Question answering systems are used to answer a specific question asked by a user in natural language [2,94]. Unlike rule-based systems which can be used to perform a multitude of tasks, question answering systems do not have to be particularly designed for a domain [95]. The system is usually initiated by a user prompt. The system can take several turns to gather all the information needed to answer the question. QA systems are not very strictly structured, but the answers depend on the knowledge domain the system is trained on. QA systems can be build to provide answers for multiple domains. Unlike conversational agents, QA systems do not attempt to mimic human-like interaction. The main focus lies on correctly answering the question in a short amount of time. The responses provided by the system are generally short and follow a natural language format. Chandra and Suyanto [20] present a question-answering dialog system which is trained on a dataset of conversations about university admission. The chatbot was designed using a sequenceto-sequence and an attention mechanism technique. Sreelakshmi et al. [21] propose a QA system which performs both question-answering and quiz-generation. Neural networks are used for the question-answering module and the quiz generation module generates question-answer pairs. The approach takes input in form of a book which is converted to the knowledge base, making it adaptable to a variety of subjects.

Evaluation
Due to the nature of multi-turn QA systems, it is quite difficult to design an accurate and automated evaluation metric. A significant amount of human intervention is required in the evaluation process. Evaluation metrics used in information retrieval systems are often employed in the evaluation of QA systems as well. Information retrieval systems respond to a user's query turn with a reply retrieved from the corpus it is trained on. The relevance of the dialog system depends on the sector and the corpus it is trained on. In addition to the training corpus, the system can also append the conversational data it obtains from interacting with users. There are several algorithms available to develop a model for an IR system. The response can be generated by using words from the prior response or using full IR ranking algorithms. Non-dialog text can also serve as the training data for IR systems. For example, dialog systems can be trained on articles to generate informative responses. Oniani et al. [96] perform subjective evaluation of language models for a COVID-19 question-answering system. The responses generated by the four chosen models were annotated by two medical experts on different relevance scores. The bidirectional encoder representations from transformers (BERT) model provided the best results out of the four.

Classification Based on Logic
Dialogue systems can be of the following two types: Rule-based and Intelligent. Rulebased systems follow a set of predefined rules and cannot answer questions that are not hard-coded. They map out conversations through the implementation of a decision tree. Intelligent systems, on the other hand are able to learn from interacting with users. They implement artificial intelligence (AI) techniques such as machine learning, deep learning, and reinforcement learning. There are certain advantages and disadvantages to both of these. Rule-based systems are quick and easy to train. The evaluation process is also simpler and they are usually more reliable. The downside is that they have a very restricted conversational range and do not evolve with interactions. AI-based dialog agents attempt to understand the meaning of the text so they are able to generate relevant responses. This requires a lot of data for training. Evaluation also becomes more difficult.

Rule-Based Systems
Piccinini [97] discusses the "imitation game" developed by Alan Turing. It is a test between human and machine subjects and the task is to differentiate between the two. If a machine is able to fool the "interrogator", it is said to have passed the Turing Test and thus has the ability to think. The first conversational system to even get close to passing the Turing Test was ELIZA introduced by Weizenbaum [7] and developed in MIT in 1960. It was trained on 'scripts'. The most popular script was DOCTOR, which was modeled as a Rogerian psychotherapist. ELIZA follows keyword identification and pattern matching to generate rule-based responses. The next to come in line was PARRY introduced by Colby [64], which was modeled as a mind of a paranoid schizophrenic. Another significant rule-based system is ALICE, introduced by AbuShawar and Atwell [6]. ALICE works on a knowledge base to generate responses based on <pattern> and <template> pairs. The knowledge base was created in AIML (Artificial Intelligence Mark-up Language), which is an extension of XML. Figure 7 provides a snippet of AIML code. FAQ agents could also be developed as a rule-based system. FAQ agents are a type of task-oriented dialog system that help businesses automate answering of frequently asked questions by users. Sethi [98] present a prototype for a simple FAQ agent that provides an easy user interface. FAQ agents are trained on domain-specific knowledge, making them highly efficient and knowledgeable. Rahman [99] introduce an ALICE chatbot presented as an FAQ agent for providing assistance to users looking for information about a university. Lee et al. [100] discuss the usefulness of chatbots in reducing administrative load by using the NASA-TLX questionnaire. FAQ agents prove to be better at providing responses as they can easily extract entities and understand intents. Van Rousselt [101] mention FAQ agents which are very commonly used now-a-days. They allow users to ask a simple questions and get a response. Follow-up questions may also be added. The industry standard to build FAQ assisted agents is by using a set of rules or a state machine.

Intelligent Systems
Intelligent systems, or AI systems, make use of natural language processing and artificial intelligence tools to generate responses. Machine learning, deep learning, and reinforcement learning tools could be used to implement intelligent systems. Intelligent systems attempt to understand the intent behind the user input, rather than keyword matching. Conversations with intelligent systems is more natural and human-like, as the system can be trained to accommodate basic human errors such as typing errors and grammatical mistakes. AI-based systems provide an advantage over rule-based systems as they are better at extracting entities and understanding intents. With the use of natural language processing, it is possible to understand the context of the user input. For example, the word 'bank' may have a different meaning depending on the context of the sentence. It could be used in the context of a river bank or a financial institution. Intelligent systems should be able keep a context of what has been said before, handle unexpected conversation turns, drive the conversation when path is diverted, and improve over time. The current state-of-the-art dialog systems include Apple's Siri [16], Amazon's Alexa [18], and Microsoft Cortana [17].

Deep Learning and Machine Learning Based Systems
Nagarhalli et al. [102] present the current trends in the development of dialog system systems. Conversational artificial intelligent bots are one of the most important application of natural language processing. Su et al. [103] propose a dialog system to interact with elderly people, with the use of a structured database containing over 2000 message and response pairs, making use of LSTMs. Kuligowska [104] explain how to get a robot to have a conversation on a story narrated by it. The robot can be developed by using a CKIB toolkit for word segmentation, parts of speech tagging, and term frequency inverse document frequency for noun identification. Artificial intelligence markup language (AIML) is used for development of the dialog system. Legacy techniques are used for the dialog system, which may not be the most efficient technique. Baby et al. [105] propose to use a dialog system for home automation with the help of IoT domain. The paper proposes to use NLTK to identify keywords which are then passed on to the micro-controller which will control home appliances. Lee et al. [106] suggest the development of a companion robot for children. Question-answer pairs are generated and ranked from about 100 student's tales with the help of logistic regression (LR). Pichponreay et al. [107] propose to use the ApachePDFBCK for optical character recognition of text present in PDF documents. The questions can be generated and string matching can be used to generate answers. Dialog systems have been showed to be efficient for use in customer service time and time again. D'silva et al. [108] show the use of dialog systems for customer service, as the problems faced by customers are not always resolved by the companies.
Dialog systems are also being widely used in the health care sector. Madhu et al. [4] propose a dialog system which takes in symptoms as user input and provides disease predictions, along with the suggested medication and dosage. There are several limitations of using a dialog system for medical purposes. There is a lack of scalability as the diseases and symptoms have to be hard-coded. Choi et al. [109] propose the development of dialog system that helps users with queries about newly purchased electronic gadgets. Understanding the features of a new product by going through a manual can be a daunting task. The dialog system can be helpful to keep pace with the user's learning speed. Kuligowska [104] propose a dialog system for answering queries about banking procedures. Customer service may not be always available to answer user's queries, and this may be particularly important if a large sum of money is involved. The dialog system is trained on a dataset comprised of frequently asked questions from various banking platforms. The paper makes use of NLTK and bag-of-words for converting words into vectors. Classification is performed through different machine learning algorithms, and the question is mapped to the answer using cosine similarity. The accuracy is claimed to reach 87%.
Models such as Sequence to Sequence (Seq2Seq) demonstrated by Latif et al. [110] and Bidirectional encoder representations from Transformers (BERT) as demonstrated by Devlin et al. [83] are the current industry best practices. Kaushik et al. [111] introduce a multi-view conversational search interface with a dialogue-based agent which assists users in refining their searches to obtain relevant results. The system was enabled with image search and implemented with the help of RASA [112]. The dialog structure for the agent is defined in the following three steps: identification of information need of user, formation and presentation of results in the chat, and successful completion of the search task. The system proves to be an effective method of lowering cognitive load and frustration of the user while performing searches. The study by Aliannejadi et al. [113] highlights different conversational search strategies used by an agent that guide the search by providing query suggestions and clarifications. Kaushik et al. [19] introduce a conversational agent to assist users in performing search tasks. The multi-view search interface includes a graphical interface and a conversational agent to help the user to perform searches without increasing their cognitive load. The information retrieval bot provides assistance to the user to develop their query in a series of interactions.

1.
Dialogflow: Dialogflow [114] provides a conversational experience powered by Artificial intelligence. Singh et al. [115] discuss the advantages of a conversation management framework such as Dialogflow in certain use cases. It can receive responses in both text and voice form, and can be integrated across virtually any platform such as speakers, home applications, wearable sensors, etc. The following components of Dialogflow are discussed in [116]: Intent matching, Entity extraction, and Dialog control. The intent matching step recognizes what the user wants. An intent is created for anything the user might request. For example, checking the price of a product, booking an appointment, checking opening hours, etc. A typical Dialogflow agent, which represents a single conversational experience, might have a few to a thousand intents, which are each trained to recognize a specific user need. Entity extraction extracts relevant information from the user query. Dialogflow can extract information using system entities. For example, in Figure 8, the intent is identified as 'Opening_time' and 'time' entity is extracted. Dialog control shapes the flow of the conversation. The subsequent dialogues are interpreted in the context of the previous input.

2.
Alexa: Amazon Alexa [18] is a virtual assistant incorporated with the Internet of Things (IoT). It can respond to any natural language query. One of the many outstanding features of Alexa is the developer tool. Every query given to it is parsed into a data structure and given to the user on AWS, where the user can then write their own custom code around it. Lopatovska et al. [118] explore user interactions with Alexa. Analysis of the data collected suggested that Alexa serves well as a virtual assistant for actions such as playing music, checking weather conditions, etc. 3.
RASA: Bocklisch et al. [112] introduce Rasa, which is an open source machine learning framework for building contextual AI assistants and conversational agents. The model is transparent, which means we can observe exactly what is happening under the hood and customize things precisely. It is a state-of-the-art model and is the most effective and time efficient tool to build complex dialog systems quickly. In RASA, an action is an operation which can be performed by the bot. It could be replying something in return, querying a database, or any other task possible by code. Action could be just a hard coded reply or some API generated response. Stories are a sample interaction between the user and bot, defined in terms of intents captured and actions performed. Harms et al. [119] discuss the use of RASA for dialog management. RASA is made of two components: Natural language understanding (NLU) and dialog management. Natural language understanding is an open-source natural language processing tool for intent classification and entity extraction. It thus helps the bot to attempt to understand the user. Singh et al. [115] explain the implementation of RASA Core. The core is the framework for machine learning-based contextual decision making. It learns by observing the patterns from conversational data. The RASA architecture follows the steps given: The message is received and passed to an interpreter, which converts it into a dictionary including the original text, the intent, and any entities that were found. This part is handled by the NLU.
The Tracker is the object which keeps track of conversation state. It receives the info that a new message has come in. (c) The policy receives the current state of the tracker. (d) The policy chooses which action to take next. (e) The chosen action is logged by a tracker and a response is sent to the user. Figure 9 shows the functionality of RASA.

Reinforcement Learning Based System
Reinforcement learning based systems learn from interacting with users and consequently develop their own control systems. The dialog exchange can be assumed to be a set of actions, which may be chatting or completing tasks. Previous dialog states are taken into account while performing next action, which ensures contextual awareness. Li et al. [120] apply a neural reinforcement learning method, which simulates conversation between two virtual agents, in order to explore all possible actions that can be taken while maximizing reward. The approach is a combination of Sequence to Sequence (Seq2Seq) model and reinforcement learning, which enables to it generate semantically appropriate responses while also optimizing future rewards. Zhao and Eskenazi [121] propose a novel end-to-end framework for a spoken task-oriented dialog system using deep reinforcement learning. A combination of reinforcement learning and supervised learning was used to provide better learning rates. The proposed model resulted in better results compared to a traditional spoken dialog system pipeline. Scheffler and Young [122] propose a method to generate human-computer dialog strategies using reinforcement learning. A simulation tool trained on a corpus of real user data is used to model user behavior. The results were promising as the learned dialog policies outperformed handcrafted policies given the same state space. Dhingra et al. [123] introduce KB-InfoBot, a search agent for searching knowledge bases without the use of complex queries. The study uses a soft retrieval process from the external database. The reinforcement learning process leads to higher success rate against both real world and simulated users.

Evaluation
Evaluation of dialog systems is a cumbersome task as there are no clear metrics on which the performance can be measured. Dialog systems exist for a number of application sectors, and each system needs to be evaluated accordingly. Usually human evaluations are considered to be more appropriate, but they can be very costly and time consuming. The effort for developing an automatic evaluation system is ongoing [25,124]. Evaluation of task-oriented and question answering systems can be relatively easy, as we have a clearly defined and unambiguous task to be performed. For example, answering a frequently asked question correctly and booking a reservation are both measurable tasks. For more complex system, a user satisfaction rating could prove to be useful. Deriu et al. [12] define three important characteristics of an efficient evaluation metric. The evaluation process needs to be automatic, as human labour is time and cost intensive. The process should be repeatable, that is it should yield similar results when used on a dialog system under similar circumstances. The ratings yielded by the evaluation should correlate to human judgements. There is also a need for explainability to figure out which features of the dialog system show correlation to the quality of the dialog. There is a trade-off of comparability to real world scenarios because the environment is very controlled. A large number of users can be recruited via crowd sourcing, for example, Amazon Mechanical Trunk [125]. The quality of the evaluation is shown to be comparable to laboratory conditions, as there is a high variability of user behavior present. In the Loebner prize contest [126], participants develop dialog systems which compete on their ability to fool the judge in a restricted chat session. The contest also evaluates dialog systems in terms of naturalness. Evaluation of spoken dialog systems (SLDs) can be done using the glass-box and black-box dialog quality metrics along with user satisfaction feedback [127]. The glass box metrics evaluate individual components such as sentence understanding and sentence recognition. Black box metrics evaluate system as whole based on user satisfaction and acceptance, evaluation of performance of the system in terms of achieving the task in terms of time taken, and number of turns. Table 1 lists a number of evaluation models implemented with the specific criteria evaluated and corresponding results. For this study, we have classified the available metrics into two categories: automated empirical metrics and user-experience evaluation. User-based evaluation tools are discussed separately for task-oriented and conversational dialogue systems. Figure 10

Empirical Evaluation Metrics
Empirical evaluation metrics can be calculated easily through the use of a mathematical formula or algorithmic script. They are quantitative in nature and do not require much human intervention to calculate. However, they do not provide a very accurate picture of the system's performance. Empirical metrics can be used for evaluating certain aspects of both task-oriented and conversational dialogue systems. For task-oriented systems, the task completion success can be evaluated by gauging the correctness of the solution provided. Efficiency costs are measured by testing the system's ability to help users. Factors such as total turns taken, time elapsed in a dialog, and the total number of queries can be used to measure the efficiency cost. These metrics provide feedback on various aspects of system responses such as similarity to ground truth, understanding of context, diversity in response, etc. Metrics such as BLEU [23], Coherence [136], ROUGE [24], Perplexity [137] have also been used in dialog system research [138][139][140]. The limitation is that in the context of dialog systems, more than one response may be considered valid. Some of the commonly used metrics are elaborated in this section.

Embedding-Based Metrics
Word embeddings are commonly used to represent the words in a document [141]. They are vector representations of a particular word and help to capture the syntactic and contextual functions of the word. Liu et al. [25] highlight several available metrics for evaluation of machine translation responses. The responses generated by a dialog system can also be thought of as an MT response, thereby allowing the use of the same evaluation metrics. Common word-embedding techniques include Word2Vec [46], Bag-of-words [142] and TF-IDF. Methods like Word2Vec approximate the meaning of a word by examining the frequency of its co-occurence with other words in the corpus. Word-embeddings are thereby calculated using distributional semantics. Vectors of individual words in a sentence are then approximated into sentence-level embeddings using some heuristic. The sentence level embedding of candidate and target response are then compared using a measure such as cosine distance. Cosine similarity is the cosine of the angle between the two vectors to approximate how similar the two vectors are.

1.
Greedy Matching: Based on the cosine similarity of their word embeddings, the tokens in two sequences are greedily matched [143]. The total score is then calculated by taking an average of all words. The greedy matching approach generally favors responses with key words that are semantically similar to those in the ground truth response [144]. Rus and Lintean [145] compare greedy and optimal matching methods for two intelligent tutoring systems. The greedy method does not obtain the global maximum similarity score between the candidate and ground responses.

2.
Embedding Average: Embedding average metric calculates sentence-level embeddings using additive composition. Additive composition computes the meanings of phrases by averaging the vector representations of their constituent words [146][147][148].
The embedding average is defined as the mean of the word embeddings of each token in a sentence. The cosine similarity between the respective sentence level embeddings is computed to compare a ground truth response and retrieved response [25].

3.
Vector Extrema: Vector extrema is another method to calculate sentence-level embeddings [149,150]. For each dimension of the word vectors, take the most extreme value amongst all word vectors in the sentence, and use that value in the sentence-level embedding [25].

BLEU
Papineni et al. [23] introduce Bilingual evaluation understudy (BLEU), which is developed from a precision method that automatically evaluates machine translation to measure the quality of the translation. A good quality corpus of human translation is used as a reference. The BLEU score ranges the machine translation from 0 to 1. An n-gram is simply a sequence of N words. Precision score works better with shorter responses. For example, the dialog system may provide two different responses to "Hello": "Hi" or "How are you?". The response with the highest BLEU score is considered to be more appropriate and the dialog system should ideally return this response. The assumption in evaluation through BLEU is that there is only one ground truth response. Dhyani and Kumar [151] made use of a bidirectional recurrent neural network to implement an assistant conversational agent and the BLEU score was calculated for the agent. Li et al. [152] analyze the performance of an information retrieval chatbot using the BLEU-N algorithm. BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores were calculated for the system. Suppose there is a single candidate ground truth response and is denoted as r and the proposed response is denoted asr. The j token in the ground truth response r is denoted by w j .
where k indexes all possible n-grams of length n and h(k, r) is the number of n-grams k in r. BLEU has several shortcomings when it comes to evaluating dialog systems. Callison et al. [153] explore the effectiveness of a BLEU score in machine translation. There is no correlation to human judgements in machine translated responses. Two different replies generated by the system may be equally appropriate but have no n-gram overlap. It is also biased and incapable of considering the semantic similarity between responses.

ROUGE
Recall-Oriented Understudy for Gisting Evaluation (ROUGE), as explained by Lin et al. [24], performs evaluation by comparing ideal summaries created by a user in order to determine the quality as a summary [154]. The precision and recall is calculated using the overlap of words between the system generated summary and the reference summary. The recall in this case refers to how much the system summary captures the reference summary. Dutta and Klakow [155] explain that the ROUGE metric can be used in dialog systems to measure the similarity of the system generated output with the reference response. The precision determines how much of the system summary was relevant. Considering the number of individual words that overlap, precision can be calculated as: where m n output is the number of n-grams in the system output, and m n overlap is the number of overlapping n-grams between system response and reference response.
where m n re f is the number of n-grams in the reference response, and m n overlap is the number of overlapping n-grams between system response and reference response.
ROUGE scores are then calculated as the granularity of texts being compared between the system summaries and reference summaries. The ROUGE-N measures the unigram, bigram, trigram and higher order n-gram overlaps. It is computed as:

METEOR
Banerjee and Lavie [156] introduce Metric for Evaluation of Translation with Explicit ORdering (METEOR), which is an automatic metric which evaluates machine translation results by conducting unigram matching of machine produced and human reference translations. METEOR attempts to overcome some limitations of the BLEU score. Unlike BLEU, METEOR takes into account the recall, which is the proportion of matched n-grams out of the total number of n-grams in the reference translation. METEOR works by aligning the candidate and target responses. Denkowski et al. [92] explain the application of the METEOR score in the evaluation of dialog system.
First, the harmonic mean of the precision (P) and recall (R) of the responses, F mean , is calculated: To account for longer matches of a given sequence of n-grams, a penalty if calculated as follows: The METEOR score is then calculated as:

Perplexity
Perplexity is the inverse likelihood of predicting the responses of the test set, that is, how accurately can the model predict the next dialogue. Chen et al. [137] discuss the use of perplexity in evaluating spoken dialogue systems. It is shown to have poor correlation to word-rate error and thus is not a good metric to evaluate language models. Adiwardana et al. [157] compare perplexity and a human evaluation metric, Sensibleness and Specificity Average (SSA). SSA is a human evaluation metric proposed by Google. Perplexity and SSA are shown to have a very strong correlation for the evaluation of an open-domain chatbot. Jena et al. [158] evaluate the chatbot with perplexity and overlap between the domain-specific and human conversation dialogues.

User Evaluation Methods
User-based evaluation methods are focus on the usability aspect of the dialog system. They provide a more comprehensive picture of the system's performance in real world scenarios. In this section, usability evaluation methods are discussed for task-oriented and conversational dialog systems.

Task-Oriented Dialog Systems
Task-oriented systems have highly structured dialogues and quantifiable tasks. For example, If the venue of an event was requested and the correct information was provided by the dialog system, the task can be considered complete. Thus, it becomes easier to evaluate these systems based on the efficiency of the task completion. User evaluation may also be implemented to make the system more adaptable to real-world scenarios. Common evaluation methodologies used for task-oriented systems are discussed in this section:

1.
User Satisfaction Modeling: The user satisfaction for any dialog system is a good indicator of it's usability. User satisfaction modeling can be conducted in three steps [12].
There is a requirement for explainability, which quantifies the impact that different properties of the dialog system have on user satisfaction. The evaluation process also has to be automated based on the properties of the dialog system. Differentiability requirement evaluates different dialog strategies using models. Two factors need to be considered while user satisfaction modeling. The agent evaluating the dialog system and the granularity at which the system is being evaluated are the major factors. The dialog could be evaluated by the user or by objective judges. The granularity of the dialog lies on two extremes: the evaluation could take place at the dialog level or exchange level. Engelbrecht et al. [159] present a method to model user satisfaction with the use of hidden markov models. Mean squared error in predicting the most probable state at each turn was calculated for the optimized model. This approach provided a way to analyze the models and features that affect the quality ratings, making it comparable to empirical ratings.

2.
User Simulation: User simulators are tools designed to simulate user behaviors. Georgila et al. [160] make use of n-gram user simulation models to evaluate spoken dialog systems. Schatzmann et al. [161] discuss the development of user simulation techniques with reinforcement learning. Two evaluation strategies can be deployed for user simulators: direct and indirect. Direct evaluation is performed on the basis of various metrics, such as precision and recall on the dialog acts, perplexity, etc. Indirect evaluation attempts to evaluate the trained dialog manager. It measures the utility of the user simulation. Kreyssig et al. [162] propose the neural user simulator (NUS), which is a neural network based evaluation approach.

3.
User Experience (UX): An important exploration was done by Holmes et al. [163] to test the applicability of conventional methods to assess conversational user interface.
There are important questions raised in the paper such as how the evaluation results will correlate when applied to conversational agent usability. The usability of the WeightMentor App conducted by Holmes et al. [128], is calculated by using three metrics. The System Usability Scale (SUS) developed by Brooke [164] is one of the most popular and commonly used means of assessing usability. There are a total 10 questions, 5 covering positive aspects and 5 covering negative aspects of the dialog system. Each question is scored out of five. Final scores are then calculated out of 100. The mean WeighMentor score places the system in the 96th ± 100th percentile. Schrepp et al. [165] applied another metric called the User Experience Questionnaire (UEQ), which thoroughly assesses the UX. UEQ tells us to what level does the system meets the user expectation and how it tests against other dialog systems. The system performed well in all UEQ scales. The final metric is the Chatbot Usability Questionnaire (CUQ). Participants were given 16 items relating to the positive and negative aspects of the system. The questions were ranked out of 5, a scale of "Strongly Agree" and "Strongly Disagree". The CUQ test provided a mean score of 76.20 with the highest score of 100. Sharma et al. [166] introduce Atreya Bot, which is developed to facilitate chemical students and researchers in performing drug related queries from the ChEMBL database. The study aims to simplify the process of performing a successful search, outlining the challenges present in fulfilling a query. User frustration and mental workload is relatively low while searching on a conversational agent as compared to a traditional search engine. 4.
PARADISE Framework: The Paradigm for Dialog System Evaluation (PARADISE), proposed by Walker et al. [29], is one of the best-known evaluation frameworks proposed for task-oriented systems. PARADISE framework is based on user ratings on the dialog level and also allows for evaluations of sub dialogues. Figure 11 explores the structure of objectives in the evaluation of a dialog system according to the PARADISE framework. The utterances are compared with a reference answer to perform automatic evaluation. This method has certain limitations, such as the evaluation process cannot discriminate between different strategies, the approach does not always generalize well, and the dialog performance cannot be attributed to system specific properties [167]. The user interacts with the dialog system and proceeds to complete a questionnaire [168]. Responses in the questionnaire are then used to compute a user satisfaction score. This score can be used as the target variable for a linear regression model. Linear regression models can then be trained with the logged conversations serving as input variables. The model can then be fitted to predict the user satisfaction for the given input variables. Variables such as tasksuccess can be extracted automatically, whereas variables such as inappropriate repair utterances need to be manually extracted by experts. The system also performs well with differing user populations and is good at performing predictions for new systems.

Evaluation of Conversational Dialogue Systems
It is trickier to evaluate conversational dialog systems due to the lack of dialogue structure and a clearly defined task to perform. The definition of high quality dialog isn't always clear and depends on the application of the conversational system. Even if there is a clear defined criteria for a high quality dialog, the measurement is not always valid. There is some discussion on whether the conversational agent should pass the Turing test. The consensus is that there should be an emphasis on developing the dialog system to appear and interact more human-like. Similar to the character development of a fictional character, people expect to suspend their disbelief while interacting with a conversational agent. The emphasis should be placed on establishing a rapport with the user during the interaction. Kaushik et al. [124] provide a comprehensive framework for evaluation of conversational agents. The paper identifies the following important criteria while performing evaluation on conversational agents: cognitive load, cognitive engagement, search as learning, knowledge gain, user experience, and software usability. The quality of conversational systems may be measured by the appropriateness of its responses and likeness to human conversations. There are several shortcomings to these approaches. There are generally two levels to evaluate conversational dialog systems: coarse-grained evaluations and fine-grained evaluations. Coarse-grained evaluations focus on the appropriateness of the responses provided by the dialog system. It includes two main concepts: appropriateness of the responses and human likeness. Fine-grained evaluations, on the other hand, focus on specific aspects of the dialog system's behavior. The system's ability to remain coherent and maintain a topic of conversation are measured. Quality of interaction is measured in terms of user satisfaction, that is whether the user gets the information they want, if they are comfortable with the system, and in which form they receive the information.

1.
Appropriateness: It is a coarse-grained concept of dialog evaluation. Many finegrained concepts, such as coherence, relevance, and correctness are also encapsulated within it. The main approach employed includes the word-overlap metrics, originally used in machine translation and summarization. Word-overlap metrics have been discussed in Section 7.1. Word-overlap based scores include BLEU score and ROUGE score, which can serve as an approximation for the appropriateness of an utterance. One drawback of these metrics is that they show no correlation to human judgements. Galley et al. [140] propose ∆BLEU, which incorporates human judgements into the BLEU score. The reference responses in the test set are rated by human judges in relevance to the context. In the model based evaluation approach, such as ADEM [129], user behavior is modeled. A broad array of behavioral aspects need to be considered for the model to prove effective. The impact of using various dialog strategies should be explained by the model. The different types of users and the typical errors made by them while using the system are encapsulated by the model.

2.
Human Likeness: The quality or a conversational agent can be measured by the Turing Test [76]. A conversational system is said to pass the Turing Test if it convincing to a human that it is human as well. The generative adversarial model as proposed by Xu et al. [169] can be used to evaluate dialog systems. The evaluation framework is made up of a generator for generating data and a discriminator to distinguish between real data and artificially generated data. The naturalness of a generated dialog response can be directly calculated by adversarial loss. As explained by Kannan and Vinyals [130], the encoder-decoder architecture employing recurrent neural networks has shown to be particularly helpful in dialog generation for dialog systems. The user query is input by the encoder. The decoder generates a response based on the final state of the first network. The method is based on Generative Adversarial Networks (GANs).

3.
Fine-grained Metrics: Topic based evaluation measures the ability of the conversational agent to coherently talk about different topics. Two dimensions for topic-based evaluation are considered: topic breadth and topic depth [170]. The ability of the system to learn about a large variety of topics is measured by topic breadth, whereas topic depth measures if the system can sustain a long and cohesive conversation about one topic. A deep averaging network (DAN) can be trained to perform topic classification and detection of topic-specific keywords. A large amount of conversational data and questions are used to train the DAN model. There could be multiple utterances that could be considered as acceptable [171]. If at least one annotator marks the response as appropriate, the response is considered appropriate by the weak agreement metric [172]. The weak agreement has certain limitations. It relies heavily on human annotations and is not applicable to large amounts of data. Voted appropriateness takes into account the number of votes an utterance received for a given context, thus overcoming the limitations of weak agreement [172]. Each utterance is weighted uniquely. Voted appropriateness also has a higher correlation to human judgement as compared to weak agreement.

Evaluation Datasets and Challenges
We have discussed the two broad categories of evaluation methods for dialog systems. It is crucial to understand the implementation of a metric as they enable us to explore the shortcomings of a system. However, conversational dialogue is usually unstructured, making it difficult to obtain a reliable evaluation. It is necessary to train the models on structured datasets for reliable evaluation and comparison of different systems. This will also increase the relevance of results to make them more comparable to real-world scenarios. The public availability of datasets has made it easier to evaluate dialog systems. Datasets could be used for several evaluation procedures and are not restricted to one metric. Serban et al. [173] have already carried out evaluation of dialog systems and provided a survey of publicly available datasets. This section covers some popularly used datasets and research challenges that focus on improving the current state of the art in dialogue systems.

Datasets for Task-Oriented Systems
Since task-oriented systems are build to complete a very specific task, it is difficult to find pre-existing datasets for these systems. The dataset for a system might be difficult to re-use. To re-use them, several changes need to be made which requires a lot of human effort. A POMDP-based dialog system mentioned by Gasic et al. [174] makes use of crowd-sources on the Amazon mechanical turk service. The paper also talks about the transferability of datasets to a new domain by creating and keeping the unknown slot as a constraint. Wen et al. [70] developed an end-to-end dialog system and also used Amazon mechanical turk service to find users for user evaluation. The Dialog State Tracking challenge (https://www.microsoft.com/en-us/research/publication/the-dialog-statetracking-challenge-series-a-review/) (accessed on 3 October 2021), which is focused on the development of the dialog state, has also led to the release of several datasets.

Data for Question Answering Dialog Systems
Data for QA dialog systems could be derived from chats on forums or online message boards. For instance, Lowe et al. [131] created the Ubuntu dialog Corpus, which contains close to a million multi-turn dialogues extracted from the chat logs that provided support for Ubuntu related problems. Qu et al. [175] present the Microsoft counterpart, known as the MSDialog. Recently there have been several datasets released publicly for evaluating QA dialog systems. Choi et al. [176] introduce QuAC, or Question Answering in Context. It consists of 14 K information-seeking QA dialog, or 100 K questions in total. Reddy et al. [177] present another such dataset called the CoQAwhich consists of 127 K questions and answers, 8 K of which were obtained from seven different domains.

Data for Conversational Dialog Systems
Micro-blogging or social media websites provide most of the data for training and evaluating conversational dialog systems. Fortunately, we do have publicly available datasets for conversational dialog systems. The Twitter corpus, which inspired the work of Ritter et al. [138], consists of 1.3 million conversations that were made publicly available. Another Twitter based dataset available (https://www.microsoft.com/en-us/download/ details.aspx?id=52375) (accessed on 3 October 2021) is a collection of 4232 three-step conversational snippets. Sordoni et al. [139] utilize the latter Twitter corpus to train an end-to-end system. The Twitter logs were labeled by crowd-sourcing to allow annotators to measure the quality of the response, given the context. Another corpus worth a mention is the Reddit Corpus which consists of 1.7 billion comments.

Evaluation Challenge
Conversational systems become more and more intelligent everyday. Evaluation challenges for dialog systems have become important for setting the current benchmark for state-of-the-art dialog systems. These evaluation challenges release the datasets used in the competition for further research. Kim et al. [178] present one of the popular challenges called Dialog state tracking challenge (DSTC). It has now reached its sixth edition and released DSTC (1-6). Pavlopoulos et al. [179] present another popular challenge which is the Conversational intelligence challenge (ConvAI). The initiative aims to unify the efforts towards building intelligent dialog systems. The tasks to performed be at the challenge include submitting a dialog system to carry on natural conversations with a human based on certain news articles. One of the most popular challenge is the Alexa Prize (https://developer.amazon.com/alexaprize) (accessed on 3 October 2021). Ram et al. [180] provide an overview of the plethora of techniques employed to build the dialog systems such as NLU, context modeling, dialog management, etc. The research and the work done is usually used to advance the current Alexa technology.

Discussion
The review study has emphasized the need to develop an evaluation method for dialog systems which can be used repeatedly without any human intervention. We observed that there are no generally accepted evaluation metrics for dialog systems. The two research questions posed at beginning of the study were: 1.
What are the evaluation methods available for Dialog systems based on the structure of the dialog?
To answer this question, we have performed a thorough search of evaluation methods and categorized them into empirical and user-based methods. We highlighted several evaluation methods for both task-oriented dialog systems and conversational agents. There was also a distinction made between automated quantitative metrics [23,24] and user evaluation methodologies [124,129]. It is easier to evaluate task-oriented systems as they need to complete a specific task, the efficiency of which could be measured. User satisfaction can also be modeled, as we have covered before, and can be use to automate the evaluation process. Conversational agents are a little trickier to evaluate, especially when reducing human effort. While evaluating conversational agents, we do not have a particular set task or a correct answer. The focus is on the 'quality' of the dialog. Human judges annotate the 'appropriateness' of a dialog. The current state-of-the-art for modelling appropriateness is by means of latent representations such as ADEM [129].

2.
What is the requirement of an automated evaluation method for testing the usability of dialog systems? To elaborate on this research question, it is necessary that empirical metrics show high correlation to human judgements, or the evaluation would be out of context. As these metrics do not correlate strongly to human judgements [25], the need for a more comprehensive evaluation metrics became apparent. Scoring high on an empirical evaluation method does not guarantee that the system performs well if users cannot navigate the system easily. On the other hand, user-based evaluation is very subjective. Thus, a gap exists in the evaluation methods available for dialogue systems. There is considerable amount of work in performing evaluation of dialog system with existing metrics. Several of the evaluation methods require human intervention to perform feature engineering, manually annotate data, and double-check the results provided by evaluation metrics. Human labor is too expensive to obtain in most cases. Certain other factors such as cognitive load on the users and human annotators also need to be considered while performing evaluation. An automated evaluation system can considerably reduce human effort, while also being able to provide insight on system quality to improve usability of the system. Existing evaluation methods depend on the type of dialog system they are evaluating, thereby reducing generalization. There is also a lack of well-defined metrics for evaluation of conversational search agents. There is a long way to go in developing evaluation methods for conversational search systems. Current evaluation methods are either very vague or very specific. The future scope of this study is to develop a comprehensive evaluation system by finding some middle ground between empirical and user-based methods. In this study, we attempt to understand the process of performing comprehensive dialog evaluation, which requires automating the process, making it repeatable, and increasing correlation to human judgements. A semi-automated evaluation approach is provided in the study done by Kaushik et al. [124], which presents the framework for an implicit evaluation method. The framework encompasses various information retrieval (IR) and userbased factors such as user satisfaction, knowledge gain, cognitive load, and so on. There is a possibility for such a standard framework accepted by different research paradigms by drafting it into an Application Programming Interface (API). The advantage of this approach is that all the data is collected together and comparative study can be performed. The data can also be stored and used in the future for further analysis. The limitations of this approach is that there is still some human effort required in the form of expert knowledge. Another approach is to develop an in-built evaluation tool with an interactive system. The system is evaluated with each user interaction and the results are sent directly to the analyser. This approach is time-saving, but it makes one-on-one comparison of dialog systems difficult. The system may be updated with each evaluation which may result in a bias, which could be subject to further investigation. Further exploration of factors which affect the conversation in multiple dimensions would be interesting. The issues and challenges discussed in this paper can be discussed by the future research community to develop better dialog systems that provide enhanced user experience.

Conclusions
There are a number of approaches for evaluating dialog systems. However, as stated in the research, no scalable and automated solution exists that does not necessitate considerable human participation. Human assessment may not always be a realistic alternative due to the difficulty in finding enough evaluators to complete the assignment. Each of the assessment methods mentioned can be improved upon and modified to generalize well to dialog systems. It is thus critical to establish an evaluation procedure that can be carried out without requiring too much human input at each phase. Furthermore, the differences in the applications of the dialog systems must be considered when constructing the assessment techniques, since a single measure cannot be applied to all systems. The future scope of this study is to develop a standard evaluation framework for dialog systems. It is crucial that dialog systems designed with different approaches (rule-based or machine learning) can be evaluated within a standard framework with similar parameters so that comparative studies can be performed in a controlled environment. The cognitive load on the user while using the system needs to be kept in mind during evaluation. The results provided by an evaluation should also be understandable to an analyzer or researcher without much expert knowledge, so as to not further increase their cognitive load. Analysis of evaluation results also requires a lot of time and it is prone to human errors. A standard framework will significantly ease the effort and make the evaluation less susceptible to human errors. It will allow comparative studies to be performed between different dialog systems. Automation will also save time by eliminating manual tasks, leaving more room for further experimentation. Thus, we can safely conclude that it is essential to develop a standard evaluation framework acceptable by the dialog research community and adaptable to an automated or semi-automated paradigm. Future prospective research of standard conversational evaluation metrics will set a benchmark for upcoming research in dialog systems. Moreover, the standard evaluation framework can further be designed and developed based on factors such as user experience, interactive experience, cognitive load, and more.