Quality Assessment Methods for Textual Conversational Interfaces: A Multivocal Literature Review

The evaluation and assessment of conversational interfaces is a complex task since such software products are challenging to validate through traditional testing approaches. We conducted a systematic Multivocal Literature Review (MLR), on five different literature sources, to provide a view on quality attributes, evaluation frameworks, and evaluation datasets proposed to provide aid to the researchers and practitioners of the field. We came up with a final pool of 118 contributions, including grey (35) and white literature (83). We categorized 123 different quality attributes and metrics under ten different categories and four macro-categories: Relational, Conversational, User-Centered and Quantitative attributes. While Relational and Conversational attributes are most commonly explored by the scientific literature, we testified a predominance of User-Centered Attributes in industrial literature. We also identified five different academic frameworks/tools to automatically compute sets of metrics, and 28 datasets (subdivided into seven different categories based on the type of data contained) that can produce conversations for the evaluation of conversational interfaces. Our analysis of literature highlights that a high number of qualitative and quantitative attributes are available in the literature to evaluate the performance of conversational interfaces. Our categorization can serve as a valid entry point for researchers and practitioners to select the proper functional and non-functional aspects to be evaluated for their products.


Introduction
As defined by Radziwill et al., conversational interfaces are one class of intelligent, conversational software agent activated by natural language input (which can be in the form of text, voice, or both). Conversational interfaces provide conversational output in response, and if commanded, they can sometimes also execute tasks [1]. Conversational interfaces are commonly referred to as chatbots when the interface is only textual.
The research into the evaluation of chatbots dates back to the early 1970s when a team of psychiatrists subjected the two earliest ones (Eliza [2], and Parry [3]) to the Turing tests. Despite the initial interest from the scientific community, the chatbot topic in the broad sense has been explored in more depth by academia only in the last two decades, mainly because of the previous lack of sufficiently sophisticated hardware and theoretical models [4,5]. In the latest years, chatbots have also gained an important commercial interest [6], which has also resulted in constant technological advancement.
Global forecasts have shown that the chatbot market is projected to grow from USD 2.6 billion in 2019 to USD 9.4 billion by 2024 at a compound annual growth rate (C.A.G.R.) of 29.7% [7], with healthcare, educational, customer services and marketing as the most affected application domains. The main driver of such commercial interest is the ability of chatbots to provide rapid responses and well enough support to customer requests [8]. The main shortcomings are instead found in the possibility of unhelpful responses and the lack of a human personality. These issues are still slowing down a widespread acceptance of chatbots [9,10]. Due to the increasing economic impact and the mentioned limitations, the need to set comprehensive and replicable approaches to test and evaluate chatbots thoroughly has been brought to light [11].
The software behind chatbots is however challenging to verify and validate with traditional testing approaches. Their evaluation is in fact strictly related to their ability to replicate human behavior, and to the user's appreciation of their output [12,13]. The nondeterminism of user input also makes the coverage of all possible inputs impractical. The semantic component of the responses to the users must also be taken into account when verifying conversational interfaces. Therefore, academic research has yet failed to converge towards an established set of metrics and actionable approaches to validate conversational interfaces.
The purpose of this paper is to present a comprehensive review of quality properties and attributes related to the quantitative and qualitative verification and validation of chatbots.
To this end, we performed a Multivocal Literature Review (MLR) study that covers not only peer-reviewed works (i.e., white literature or WL) but also grey literature accessible through traditional search engines. By involving the latter, we aimed at capturing valuable information produced by practitioners from the industry, and to compare the practitioners' focus to that of academia. By analyzing the most widespread attributes and properties analyzed in both categories of literature, we discuss potential gaps in current research and practice and the related implications for industry and academia.
The remainder of this paper is organized as follows: • Section 2 presents the background about the evaluation processes of chatbots and compares this work to existing secondary studies in the field; • Section 3 describes the adopted research methods by specifying its goals, research questions, search criteria, and analysis methods; • Section 4 presents the results of the MLR; • Section 5 discusses the implications of the results and the threats to the validity of our study; • Finally, Section 6 concludes the study and presents possible future avenues for this work.

Background
In this section, we summarize the concepts about the chatbot evaluation processes defined in the academic literature. We also present background concepts regarding applying the MLR research method in the field of Software Engineering, and we discuss the findings of existing secondary studies in this field.

Overview of Quality Assessment Methods for Conversational Interfaces
The literature on chatbots has highlighted a lack of precise guidelines for designing and evaluating the quality of this type of software. Amershi et al. propose a set of guidelines tailored to the peculiar human-AI nature of the interaction with chatbots [14].
The quality attribute for a chatbot can relate to many different aspects of its usage, e.g., the capability of providing the right answers or to infer the right emotions from the human users, or the end user's satisfaction [1]. However, the quality properties to evaluate depend on the purpose and application domain of the specific chatbot to be evaluated, making it difficult to find universal attributes.
Concerning the latest generation of chatbots, based on deep learning and structured information [15], most of the research in the field has emphasized the use of annotated datasets, defined as ground truth. The generation of annotated datasets may imply the presence of human labellers (i.e., a supervised approach) or can be performed automatically based on the characteristics of data (i.e., an unsupervised) approach. Once a ground truth is obtained, the chatbot model infers its behavior learning from this dataset, and the performance is evaluated over a subset of the dataset, the so-called test set.
Several studies in the literature address the application of automated testing methodologies to verify the quality of chatbot software. The many aspects to be considered in chatbot evaluation however make fully automated testing practices harder to adopt than in traditional software domains.
The hardest features to verify are those related to the perception of the chatbot by a human user, and the perceived value of obtained information. For these reasons, manual testing of chatbots is rarely used alone practitioners, and is typically conducted with the aid of questionnaires and interviews [16][17][18]. Manual testing is, however, inherently error prone and labor intensive, hence it is typically paired with evaluations by domain experts, and aided with platforms for crowdsourced testing [19,20]. Quantitative measurements of the performance of the chatbots (e.g., inspection of abandoned dialogues) are instead most of the time completely automated.

Multivocal Literature Reviews
Ogawa et al. define the concept of the Multivocal Literature Review (MLR) [21] in the field of Education as a research methodology that applies the approach of a Systematic Literature Review on multiple literature sources, i.e., involving evidence available on regular search engines. In that sense, MLRs differ from regular SLRs (Systematic Literature Reviews) because they include information obtained from non-academic sources, such as blog posts, web-pages, and industry white papers. According to a definition provided by Lefebvre et al., these sources can all be classified as Grey Literature (GL), i.e., literature that is not formally published in sources such as books or journal articles [22]. Several classifications of the forms of GL have been provided in the literature. Adams et al. provide a three-tier classification of GL: 1st tier GL (high credibility), including books, magazines, government reports, and white papers; 2nd tier GL (moderate credibility), such as annual reports, news articles, presentations, videos, Question and Answer websites (such as StackOverflow) and Wiki articles; 3rd tier GL (low credibility), such as blogs, e-mails, tweets [23].
Albeit that many review studies in the field of Software Engineering (SE) have implicitly incorporated GL to derive their findings, a formalization of the MLR methodology for SE has been provided only recently by Garousi et al. [24]. The authors identified the principal benefit of including GL in literature reviews as the capability of providing useful industry viewpoints and evidence of the quality that cannot always be gathered from peerreviewed literature [25]. Rigorous MLRs have recently been conducted in the field of SE to investigate, for instance, the need for automation for software testing [26], software test maturity assessment and test process improvement [27], security in DevOps [28], the benefits of the adoption of the Scaled Agile framework [29], requirements engineering in software startups [30], technical debt [31], and the financial aspects of its management [32].

Related Work
Many studies have already had as their primary purpose an examination of quality attributes for chatbots. Several systematic reviews of chatbot quality assessments have been performed and are available in the literature. In Table 1, we report the secondary studies available at the time this review was conducted. For each of the secondary studies considered, we report the research methodology employed, the number of primary studies referenced, and the main contribution.
The Grey Literature work by Radzwill and Benton [1] analyze the data from 46 primary studies, both from academic and grey literature. In the manuscript, the authors define three different categories of quality attributes for conversational agents: effectiveness, efficiency, and satisfaction. These three categories have been chosen following the definition of Software Usability, paying close attention not only to functionality but also to human-like aspects. In this work, the quality assessment approaches are reviewed, and, ultimately, a synthesized approach is proposed. To date, this study is the one that proposes the most comprehensive list of quality attributes for chatbots. However, the study has some limitations especially in terms of replicability, since there is no explicit adherence to a review protocol (e.g., Kitchenham's). Moreover, no explicit Inclusion and or Exclusion criteria are provided for the selection of manuscripts. In addition to this lack of formality of the review process, the manuscript is also focused only on quality attributes and not on frameworks or datasets that can be utilized for chatbot evaluations. Finally, the manuscript also lacks a mapping section to provide a categorization of the available academic research in the field.
Maroengsit et al. [12] performed a survey to assess the various architectures and application areas of chatbots, providing a categorization and analysis based on 30 different conversational interfaces. The authors provide a review of natural language processing techniques and evaluation methods for chatbots. The evaluation methods part is divided into content evaluation, user satisfaction, and functional evaluation. This work has a primary focus on natural language processing and low-level white box metrics. However, the work provides a limited analysis of black-box metrics focused on the user perspective. It only considers a high-level subdivision for them (e.g., automatic evaluation and expert evaluation), which we deem not sufficient to cover the complexity of all quality attributes and metrics provided by the available research.
Finally, Ren et al. [33] provided a systematic mapping study about the usability attributes for chatbots. Several specific quality assessment metrics and methods are discussed in this work. However, an exhaustive discussion of quality attributes or frameworks is not provided. The authors also provided a classification of the conversational interfaces in different categories: AIML (Artificial Intelligent Markup Language), NLP (Natural Language Processing), ORM (Object Relational Mapping), and ECA (Embodied Conversational Agents), but no explicit classification of the evaluation techniques used for each of the categories is provided in the manuscript. In addition to the limitation of the analysis to the evaluation of usability, the study does not include an analysis of grey literature sources.
The present work aims to review a broader range of recent sources and to integrate the contribution of works from grey literature and practitioners' reports, which are generally included only to a limited extent in the mentioned works. We also aim at providing an analysis of existing frameworks and datasets used for chatbot evaluations, and a mapping of the literature about the mentioned research facets, which has not been provided yet by related studies.

Research Method
This section provides an overview of the research method that we adopted when conducting this study.
We conducted an MLR by following the guidelines provided by Garousi et al. [24]. These guidelines are built upon Kitchenham's guidelines for conducting SLRs [34], with the addition of specific phases that tackle the procedure of selection and filtering of the grey literature.
According to these guidelines, the procedure for conducting an MLR is composed of three main phases:

1.
Planning: in this phase, the need for conducting an MLR on a given topic is established, and the goals and research questions of the MLR are specified; 2.
Conducting: in this phase, the MLR is conducted entailing five different sub-steps: definition of the search process, source selection, assessment of the quality of the selected studies, data extraction, and data synthesis; 3.
Reporting: in this phase, the review results are reported and tailored to the selected destination audience (e.g., researchers or practitioners from the industry).
In the following subsections, we report all the decisions taken during the Planning and Conducting phases of our study. The Results and Discussion sections of the paper will serve as the output of the Reporting phase.

Planning
This section describes the components of the planning phase according to the guidelines by Garousi et al.: motivation, goals, and RQs. This information is reported in the following.

Motivation Behind Conducting an MLR
To motivate the inclusion of Grey Literature in our literature review, and thus to conduct an MLR, we adopted the approach based on the decision table reported in Table 2, defined by Garousi et al. [24] and based on the guidelines by Benzies et al. [23,35]. One or more positive responses to the questions in the table suggest the inclusion of GL in the review process.
As is evident from our decision table shown in Table 2, we could provide a positive answer to all questions about the addressed subject. Hereby, we provide a brief motivation about each point in the decision table: 1.
The subject is not addressable only with evidence from formal literature, since typically real-world limitations of the conversational interfaces are addressed by white literature only to a certain extent; 2.
Many studies in the literature do provide methods for the evaluation of conversational interfaces with small controlled experiments, which may not reflect the dynamics of the usage of such technologies by practitioners; 3.
The context where they are applied is of crucial importance for conversational interfaces, and grey literature is supposed to provide more information of this kind since they it is more strictly tied to actual practice; 4.
Practical experiences reported in grey literature can indicate whether the metrics or approaches proposed in the formal literature are feasible or beneficial in real-world scenarios; 5.
Grey literature can reveal the existence of more evaluation methodologies and metrics than those that could be deduced from white literature only; 6.
Observing the outcomes of measurements on commercial products can provide researchers with relevant insights regarding where to focus research efforts; conversely, practitioners can deduce new areas in which to invest from the white literature; 7.
Conversational interfaces and their evaluation are prevalent in the software engineering area, which accounts for many sources of reliable grey literature. Is it the goal to validate or corroborate scientific outcomes with practical experiences? Yes 5 Is it the goal to challenge assumptions or falsify results from practice using academic research or vice versa? Yes 6 Would a synthesis of insights and evidence from the industrial and academic community be useful to one or even both communities? Yes 7 Is there a large volume of practitioner sources indicating high practitioner interest in a topic? Yes

Goals
This MLR aims to identify the best practices concerning specific procedures, technologies, methods, or tools by aggregating information from the literature. Specifically, the research is based on the following goals: • Goal 1: Providing a mapping of the studies regarding the evaluation of the quality of conversational interfaces. • Goal 2: Describing the methods, frameworks, and datasets that have been developed in the last few years for the evaluation of the quality of conversational interfaces. • Goal 3: Quantifying the contribution of grey and practitioners' literature to the subject.

Review Questions
Based on the goals defined above, we formulate three sets of research questions.
Regarding the first goal, we identify two mapping research questions that can be considered common to all MLR studies:

Conducting the MLR
According to the MLR conduction guidelines formulated by Garousi et al., in this section, we report the source selection, search strings, the paper selection process, and how the data of interest were extracted from the selected sources.
The process that we conducted for this study is described in the following sections and outlined in Figure 1. In the diagram, we report the number of sources in our pool after executing each step of the review.

Search Approach
To conduct the review, we followed the following steps: • Application of the search strings: the specific strings were applied to the selected online libraries (for the white literature search) and on the Google search engine (for the grey literature search); • Search bounding: To stop the searching of grey literature and to limit the number of sources to a reasonable number, we applied the Effort Bounded strategy, i.e., we limited our effort to the first 100 Google search hits; • Removal of duplicates: in our pool of sources, we consider a single instance for each source that is present in multiple repositories; • Application of inclusion and exclusion criteria: we defined and applied the inclusion and exclusion criteria directly to the sources extracted from the online repositories, based on an examination of titles, keywords, and abstracts of the papers; • Quality assessment: every source from the pool was entirely read and evaluated in terms of the quality of the contribution. • Backward Snowballing [36]: all the articles in the reference lists of all sources were added to the preliminary pool and evaluated through the application of the previous steps. We also added to the pool of grey literature the grey literature sources cited by white literature; • Documentation and analysis: the information about the final pool of paper was collected in a form including fields for all the information needed to answers the formulated research questions.

Selected Digital Libraries
To find white literature sources regarding our research goal, we searched the following academic online repositories: The repository held by the Association for Computational Linguistic (ACL Anthology) was excluded from this list, given that the results showed a complete overlap with those obtained from the Google Scholar engine.
To these sources, we added Google's regular search engine to find grey literature sources related to our research goal.

Search Strings
A pool of terms was defined through brainstorming to determine the most appropriate terms for the search strings: Initial pool of terms In Table 3, we report the search strings based on the pool of terms, and formulated for each digital library. The search strings include all the elicited synonyms of the terms chatbot, quality assessment, framework and datasets.

# Search String
IEEE Xplore (((chatbot* OR conversational) AND (interface* OR agent*)) AND (metric* OR evaluat* OR "quality assessment" OR analysis OR measur*)) Elsevier Science Direct (((chatbot OR conversational) AND (interface OR agent)) AND (metric OR evaluation OR "quality assessment" OR analysis OR measurement)) ACM Digital Library (((chatbot* OR conversational) AND (interface* OR agent*)) AND (metric* OR evaluat* OR "quality assessment" OR analysis OR measur*)) Springer Link ((chatbot* OR conversational) AND (interface* OR agent*)) AND (metric* OR evaluat* OR "quality assessment" OR analysis OR measur*) Google Scholar metric OR evaluation OR "quality assessment" OR analysis OR measurement "chatbot interface" Google metric OR evaluation OR "quality assessment" OR analysis OR measurement "chatbot interface" In the search on digital libraries, we filtered the results for publication dates between 2010 and September 2021. For the search on Google Scholar, we used the Publish or Perish (PoP) (https://harzing.com/resources/publish-or-perish, accesssed on 18 October 2021) tool; for the other sources, we used the official utilities and APIs exposed. Since the final objective was to extract and inspect all the related sources published in the 2010-2021 time frame, the search ordering was not taken into account. We excluded patents from Google Scholar results.
We used a Python script to remove exact duplicates, by retrieving pairs of articles with more than 80% overlapping words in their titles. We developed a stand-alone script that analyzed the results provided by the PoP tool and by the APIs (in the form of .csv files), and that cycled over all manuscript titles to signal potential overlaps.
The correctness of the signalled overlaps were verified by a manual check on the resulting list. A single entry was maintained for each pair of identical articles published in more than one source. A total of 1376 unique white literature papers were gathered in this step.
Regarding grey literature, we collected 100 contributions using the Google search engine. Before performing the search, we cleaned the browser of cookies and history before performing the search to avoid influencing the search's replicability. We narrowed down the search results to web pages published before the end of September 2021, by applying the before: 30 September 2021 modifier at the end of the search string. We excluded academic sources that resulted from searches on the regular Google Search engine. The search hits were ordered by relevance, by keeping the default Google Search behavior.

Inclusion/Exclusion Criteria
Inclusion Criteria (from now on, IC) and Exclusion Criteria (from now on, EC) were defined to ensure gathering only the sources relevant to our research goal.
• IC1 The source is directly related to the topic of chatbot evaluation. We include papers that explicitly propose, discuss or improve an approach regarding evaluation of conversational interfaces. • IC2 The source addresses the topics covered by the research questions. This means including papers using or proposing metrics, datasets, and instruments for the evaluation of conversational interfaces.
The source is an item of white literature available for download and is published in a peer-reviewed journal or conference proceedings; or, the source is an item of 1st tier Grey Literature; • IC5 The source is related (not exclusively) to text-based conversational interfaces.
Conversely, the exclusion criteria we applied were: • EC1 The source does not perform any investigation nor reports any result related to chatbots, corresponding evaluation metrics, datasets used to evaluate chatbots. • EC2 The source is not in a language directly comprehensible by the authors. • EC3 The source is not peer-reviewed; or, the paper is Grey Literature of 2nd or 3rd tier. • EC4 The source is related exclusively to a different typology of conversational interface.
Sources that did not meet the above Inclusion Criteria, or that met any of the Exclusion Criteria, were excluded from our analysis.
The first round of IC/EC and the theoretical saturation was applied considering the title and the abstract: 115 papers passed the round. From the grey literature, other 28 documents were added to the pool, 24 from google search engine and 4 from white literature snowballing that led to artifacts of grey literature.
The order in which the source is considered influences the final pool due to exhaustion criteria.

Quality Assessment of Sources
Each author evaluated the quality of the sources based on some aspects advised by Garousi's guidelines to perform MLRs: authority of the source, methodology, objectivity, position, novelty, impact. Each source was hence voted on by using a Likert scale. We adopted a threshold of an average score of 2.5 to keep the sources in the final pool.

Data Extraction and Synthesis
Once we gathered our final pool of sources, we executed the step of data extraction and synthesis on all white and grey literature works. All the studies were inserted into an online repository that was shared among the authors to facilitate concurrent analysis of the sources.
The contributions were initially described in the Google Docs spreadsheet by comments, summary texts, and inferences drawn from the documents in a descriptive way.
We did not use pre-determined categories to categorize and map the papers and extract quality attributes from them (and so, respectively, address RQ1 and RQ2). However, hence we applied the Grounded Theory approach. We adopted the Straussian definition of Grounded Theory [37], which allows up-front definition the Research Questions instead of letting them emerge from the data analysis.
The categories responding to research questions were generated through Open Coding [38]. We did not consider the inferred categories as mutually exclusive for the types of contribution and research for the quality attributes to extract. For quality attributes and contribution types, we also applied the Axial Coding procedure [39], to remove redundancies from the categories and potentially merge the less populated ones.
The open and coding procedures were performed on each paper by all the authors of this literature review independently, and divergences were discussed to find a single (set of) categories for each manuscript and attribute.

Final Pool of Sources
After the application of all the stages described above, as shown in Table 4  In Figure 2, we report the distribution of the contributions per year, discriminating between the sources gathered with direct search and those obtained through snowballing. Since we did not apply the EC regarding the publication year on the sources obtained through snowballing, we obtained papers published before 2010. We grouped them in the plot's first column. On the other hand, we deem it not meaningful to report the publication year of grey literature sources. By adopting the Effort bounded strategy (i.e., taking into account only the first 100 hits on the Google search engine), in fact, the results are naturally biased towards the most recent sources. Older grey literature (mainly blog posts or similar sources) may no longer be available due to missing systematic archiving. By analyzing the publication years of WL sources, we can see that the sources have experienced an increasing trend in the recent years. At the same time, there was little interest in the decade's central portion. This trend can be justified by taking into account the current higher availability of machine learning algorithms and repositories, which can be used to perform more agile assessments and evaluations on conversational agents.    (56) and South Korea (26). Figure 5 shows the number of sources by type of contributors. We divided the sources into three different categories: (i) sources of which all authors were academic; (ii) sources of which all authors were working in industry; (iii) sources that were the output of a collaboration between authors from industry and from academia. All-academic sources outnumbered all-industrial sources (58 vs. 42). Of white literature sources, 52 were the output of academic studies, 18 were the output of collaborations, and 13 were industrial studies. On the other hand, most grey literature papers were industrial (29 sources vs. 6 academic studies).

Results
This section presents the results of our analysis of gathered sources, and the answers to the Research Questions that guided the data extraction from the selected pool of sources. While thoroughly analyzing all papers in the final pool, we applied the Grounded Theory approach to categorize the paper's type of contribution. The categorization was based only on the main contributions of the papers (i.e., accessory content that is not deeply discussed or that is not the primary finding of a source is not considered for its categorization). The categorization was not considered mutually exclusive.
After the examination of all the sources in the final pool, we came up with the following seven categories of contributions: • Chatbot description: sources whose primary focus is the description of the implementation of a novel conversational interface. • Guidelines: sources (typically more descriptive and high-level) that list sets of best practices that should be adopted by chatbot developers and/or researchers to enhance their quality. Sources discussing guidelines do not need to explicitly adopt or measure quality attributes or metrics for chatbot evaluation. • Quality attributes: sources that discuss explicitly or implicitly one or more qualitative attributes for evaluating textual conversational interfaces.
• Metrics: sources that explicitly describe metrics-with mathematical formulas-for the quantitative evaluation of the textual conversational interfaces. • Model: sources whose main contribution is a presentation or discussion of machine learning modules finalized to enhance one or more quality attributes of textual conversational interfaces. Regarding models, many different models were mentioned in the analyzed studies. Some examples are: Cuayáhuitl et al., who adopt a 2-layer Gated Recurrent Unit (GRU) neural network in their experiments [40]; Nestorovic also adopts a two-layered model, with the intentions of separating the two components contained in each task-oriented dialogue, i.e., intentions of the users and passive data of the dialogue [41]; Campano et al. use a binary decision tree, which allows for a representation of a conditional rule-based decision process [42]. • Framework: sources that explicitly describe an evaluation framework for textual conversational interfaces, or that select a set of parameters to be used for the evaluation of chatbots. In all cases, this typology of sources clearly defines the selected attributes deemed essential to evaluate chatbots. The difference between this category and Quality attributes and Metrics lies in the combination of multiple quality attributes or metrics into a single comprehensive formula for evaluating chatbots.  Figure 6 reports the distribution of all the studies of the final pool according to the type of contribution they provide. In the bar plot, we differentiated the number of white literature and grey literature sources providing each type of contribution. It is worth underlining that the total sum of the contributing sources is higher than the number of papers in the final pool we used for our review, since contribution type category was not an attribute that classified the sources exclusively. The most present contribution facet was Quality Attributes, with 104 different sources (around 88% of the total); this result was expected since keywords related to quality evaluation and assessments were central in the search strings fed to the engines. A total of 41 sources (35%) presented metrics for the evaluation of conversational agents, and 29 sources (25%) presented guidelines for the evaluation of conversational agents. The least present category of contribution was a chatbot presentation, with only 12 sources (10%). Figure 7 shows the total number of white literature studies for each year, grouped by the type of contribution provided. The graph shows that some contributions (especially models and published datasets) represented a significantly higher portion of the publications in recent years. On the other hand, a high percentage of sources providing quantitative metrics date back to 2010 or before.  [43]. We mutuated the categorization provided by Petersen et al. [44]: to avoid having too sparse a distribution of our sources among the categories; we resorted to adopting four high-level categories to describe the research type of the analyzed manuscripts.
The four research typologies that we considered are the following: • Descriptive and opinion studies: Studies that discuss issues about conversational interface validation and measurement and that propose metrics and frameworks for their evaluation from a theoretical perspective. This category's studies do not propose technical solutions to improve conversational interfaces or compute quality attributes upon them. Neither do they set up and describe experiments to measure and/or compare them. • Solution proposals: Studies proposing technical solutions (e.g., new chatbot technologies, machine learning models, and metric frameworks) to solve issues in the field of conversational interfaces, and that explicitly mention quality attributes for chatbots. However, the studies only propose solutions without performing measurements, case studies, or empirical studies about them.  agents. The only sources among grey literature that provided empirical evaluations were four studies that we categorize as pointers to white literature documents or Master's theses. The lowest number of occurrences was for solution proposals (15 sources, 13%). We interpret this low number to be due to the keywords used in the search strings, which led us to exclude papers proposing technological advancements, without featuring explicit evaluations of the technologies in terms of quality attributes and metrics.   We applied the Grounded Theory methodology to define a taxonomy of the quality attributes used to evaluate textual conversational interfaces. We refer to the guidelines by P. Ralph for the definition of taxonomies through Grounded Theory in Empirical Software Engineering [45]. We applied the Axial Coding technique to derive macro-and sub-categories in our derived taxonomy of quality attributes.
Our investigation about quality attributes used in evaluating conversational agents demonstrated that the researchers' practice had taken quite some distance from the traditional distinction between functional and non-functional quality evaluation. More recent work in the field has started considering the tight connection between conversational agents' responses and their users' emotional sphere [46,47]. Thus, the separation between the concepts of usability and functionality is not so evident for chatbots as it is for traditional categories of software.
We found four main macro-categories: Relationship, Conversation, Application Usability, and Application Metrics. Each category was divided into sub-clusters. We performed an analysis of the leaves of the taxonomy in each cluster to group together synonyms and different definitions of equivalent non-functional attributes in different sources. Figure 10, reports the taxonomy of categories of quality attributes obtained from our literature review. In Table 5, we report the full list of quality attributes found in the considered sources, along with the list and number of sources mentioning each of them.
Below we describe the taxonomy categories and report some examples of how the quality attributes are described in the primary sources: • Relational attributes. Quality attributes that measure the relationship with the user on human-related aspects or describe the human characteristics of a chatbot. Relational aspects do not directly affect the communication correctness but rather enrich it by creating emotional bonds between the user and the chatbot.
These attributes cannot always be clearly separated from functionality, since in various applications establishing a human connection with the user is the main functional goal for which the conversational agent is used (e.g., in the medical field). As an example, Looije et al. report that "Research on persuasive technology and affective computing is providing technological (partial) solutions for the development of this type of assistance, e.g., for the realization of social behavior, such as social talk and turn-taking, and empathic behavior, such as attentiveness and compliments" [48]. Among Relational Attributes, we identify two sub-categories: -Personality: Attributes that are related to the perceived humanness of the chatbot, which are generally reserved to describe essential and distinctive human traits. In this category, the most prominent attribute considered is the Social Capacity of the Conversational Agent. Chen et al., for instance, [46] identify several attributes that can all be considered as expressions of the social capacity of an agent working in a smart home, and that they translate into several guidelines for the behavior of a chatbot (e.g., "Be Friendly", "Be Humorous", "Have an adorable persona"). Several manuscripts also refer to the Empathy of chatbots, which implies the ability of the chatbot to correctly understand the emotional tone of the user and avoid being perceived as rude.
Another frequently mentioned attribute is the Common Sense of the Conversational Agent, also defined as the context sensitiveness of the agent [49], or the match between the system and the real world [50]. Many studies in the pool have also mentioned the ethics of the Conversational Agents: in a grey literature source about the faults of a commercial chatbot, the concept is defined as as "the need to teach a system about what is not appropriate like we do with children" [51].

-
Relationship with the user: quality attributes that directly affect the relationship between the chatbot and the user. Trust [52] and Self-Disclosure [53], for instance, are essential to triggering rich conversations. Memory (also defined as User History) is a basic principle to keep alive the relationship with the user over time. Customization (also defined as Personalization, User-Tailored content [47], Personalized experience [54]) is an important parameter to improve the uniqueness of the conversation for each specific user. Finally, Engagement [46] and Stimulating Companionship [55] are relevant attributes to measure the positive impact on the user.
• Conversational attributes. Quality attributes are related to the content of the conversation happening between the chatbot and the user. We can identify two sub-categories of Conversational attributes: -Language Style: attributes related to the linguistic qualities of the Conversational Agents' language. The most mentioned language style attribute is Naturalness, defined by Cuaydhuitl et al., as the fact that the dialogue is "naturally articulated as written by a human", [40], also referred to as Human-like Tone and Fluency. Relevance refers to the capability of the system to convey the information in a way that is relevant to the specific context of application [56], to keep the answers simple and strictly close to the subjects [57], and to avoid providing information overload to the users [58]. Diversity refers to the capability of the chatbot to use a varied lexicon to provide information to the users and the capability to correctly manage homonymy and polysemy [59].
Conciseness is a quality attribute that takes into account the elimination of redundancy without removing any important information (a dual metric is Repetitiveness).
-Goal Achievement: attributes related to the way the chatbot provides the right responses to the users' goal. They measure the correctness of the output given by the chatbot in response to specific inputs.
The most cited quality attribute for this category is the Informativeness of the chatbot, i.e., the chatbot capability to provide the desired information to the user in a given task. Informativeness is also defined as Usefulness, measured in terms of the quantity of the content of the answers given to the users [60], or Helpfulness [61]. Correctness instead evaluates the quality of the output provided by the chatbot measured in terms of the correct answers provided to the users [62], and the accuracy of the provided content. Proactiveness (in some sources referred to as Control of topic transfer [63], Initiate a new topic appropriately [46], Topic Switching [64] and Intent to Interact [65]), is the capability of the chatbot to switch or initiate new topics autonomously. Richness is defined as the capability of the chatbot to convey rich conversations with a high diversity of topics [66]. Goal achievement attributes include the capability of the chatbot to understand and tailor the conversation to the Context (i.e., Context Understanding, also defined as Context-Awareness [57], Context Sensitiveness [49] and Topic Assessment [67,68]) and to the user (i.e., User Understanding).
• User-centered Attributes: attributes related to the user's perception of a chatbot. These attributes are mostly compatible with traditional Usability software non-functional requirements.
The most-frequently cited user-centered attributes are Aesthetic Appearance, User Intention to Use, and User Satisfaction. Aesthetic Appearance refers to the interface that is offered to the user for the conversation. Bosse and Provost mentions the benefits of having photorealistic animations [69]; Pontier et al. performed a study in which the aesthetic perception of the participants was evaluated in a scale going from attractive to bad-looking [70]. User Satisfaction is defined as the capability of a conversational agent to convey competent and trustworthy information [71], or the capability of the chatbot to answer questions and solve customer issues [72]. User Intention to Use is defined as the intention of a user to interact with a specific conversational agent [65]. Jain et al. measured the user's intention to use the chatbot again in the future as an indicator of the quality of the interaction [73]. Ease of Use is defined as the capability of the chatbot to offer easy interaction to the user [57], i.e., the capability of the chatbot to allow the users to write easy questions that are correctly understood [58], and to keep the conversation going with low effort [74]. The ease of use of a chatbot can be enhanced, for instance, by providing routine suggestions during the conversation [75]. Other important parameters for the usability of a conversational agent are the Mental Demand and Physical Demand required by an interaction with it [50,56,73]. • Quantitative Metrics. Quantitative metrics can be objectively computed with mathematical formulas. Metrics are generally combined to provide measurements of the quality attributes described in the other categories. We can divide this category of quality attributes into the following sub-categories: -Low-Level Semantic: grey box metrics that evaluate how the conversational agent's models correctly interpret and classify the input provided by the user. Several papers, especially those based on the machine learning approaches, report metrics related to these fields, e.g., the use of word embeddings metrics [76] or confusion matrices [77]. The most cited metrics of the category are common word-overlap-based metrics (i.e., they rely on frequencies and position of words with respect to ground truth, e.g., a human-annotated dataset), BLEU, ROUGE, METEOR, CIDEr, and word-embedding-based metrics like Skip-Thought, Embedding Average, Vector Extrema and Greedy Matching [60]. Low-level Semantic metrics are typically employed to evaluate the learning models adopted by the chatbots and to avoid common issues like underfitting (i.e., the inability to model either training data or new data) which translates to a very low BLEU metric, or overfitting (i.e., poor performance on new data and hence small generalizability of the algorithm) which translates to a very high BLEU metric.
In this category of metric we also include traditional metrics used to evaluate the prediction of Semantic  [78]. -Time-related metrics: time and frequency of various aspects of the interaction between the user and the chatbot. The most mentioned metric is the Response Time, i.e., the time employed by the chatbot to respond to a single request by the user [79]. Other manuscripts report the time to complete a task (i.e., a full conversation leading to a result) or the frequency of requests and conversations initiated by the users.

-
Size-related metrics: quantitative details about the length of the conversation between the human user and the chatbot. The most common is the number of messages, also referred to as the number of interactions or utterances, with several averaged variations (e.g., the number of messages per user [41] or per customer [80]).

-
Response-related metrics: measures of the number of successful answers provided by the chatbot to meet the users' requests. Examples of this category of metrics are the frequency of responses (among the possible responses given by the chatbot) [81], the task success rate (i.e., entire conversations leading to a successful according to the user's point of view) [82], the number of correct (individual) responses, the number of successful (or, vice-versa, of incomplete) sessions [83].

Proposed Frameworks for the Evaluation of Conversational Agents (RQ2.2)
In this section, we describe a set of explicitly mentioned frameworks, proposed, and/or implemented in the final pool sources that we examined. A summary of the frameworks is provided in Table 6. • ADEM: an automated dialogue evaluation model that learns to predict human-like scores to responses provided as input, based on a dataset of human responses collected using the crowdsourcing platform Amazon Mechanical Turk (AMT).
The ADEM framework computes versions of the Response Satisfaction Score and Task Satisfaction Score metrics. According to empirical evidence provided by Lowe et al. [84] the framework outperforms available word-overlap-based metrics like BLEU. • Botest: a framework to test the quality of conversational agents using divergent input examples. These inputs are based on known utterances for which the right outputs are known. The Botest framework computes as principal metrics the size of the utterances and the conversations and the quality of the responses. The quality of responses is evaluated at syntactical level by identifying several possible errors, e.g., word order errors, incorrect verb tenses, and wrong synonym usage. • Bottester: a framework to evaluate conversational agents through their GUIs. The tool computes time and size metrics (mean answer size, answer frequency, word frequency, response time per question, mean response time) and response-based metrics, i.e., the number and percentage of correct answers given by the conversational interface. The tool receives the files with all questions submitted, the expected answers, and configuration parameters for the specific agent to test to give an evaluation of the proportion of correct answers. The evaluation provided by the tool is focused on the user perspective. • ParlAI: a unified framework for testing dialog models. It is based on many popular datasets and allows seamless integration of Amazon Mechanical Turk for data collection and human evaluation, and is integrated with chat services like Facebook Messenger. The framework computes accuracy and efficiency metrics for the conversation with the chatbot. • LEGOEval: an open-source framework to enable researchers to evaluate dialogue systems by means of the online crowdsourced platform Amazon Mechanical Turk.
The toolkit provides a Python API that allows the personalization of the chatbot evaluation procedures. Table 7 reports the list of datasets mentioned in the selected pool of sources. In our analysis of the pool, we do not find any dataset explicitly defined to evaluate chatbots. Exceptions are made for conversational agents based on machine and deep learning approaches, where datasets are divided into training, validation, and test sets to validate the machine learning approaches.     In this section, we list all the specific data sources used to evaluate chatbots or mentioned in the papers that we evaluated, with a categorization based on the type of data they contain.

Proposed Datasets for the Evaluation of Conversational Agents (RQ2.3)
• Forums and FAQs. Since forums and FAQ pages are based on a question and answer, conversational nature, they can be leveraged as sources of dialogue to train and test chatbots. The datasets of this category used in the examined sources are the following: -IMDB Movie Review Data: a dataset of 50K movie reviews, containing 25,000 reviews for training and 25,000 for testing, with a label (polarity) for sentiment classification. The dataset is used, for instance, by Kim et al. to address common sense end ethics through a neural network for text generation [90]. -Ubuntu Dialog Corpora: a dataset of 1 million non-labeled multi-turn dialogues (more than 7 million utterances and 100 million words). The dataset is oriented to machine learning and deep learning algorithms and models. -Yahoo! Answers: datasets hosted on GitHub repositories, containing scrapings of the questions and answers on the popular community-driven website. -2channel: a popular Japanese bulletin board that can be used as a source for natural, colloquial utterances, containing many boards related to diverse topics. -Slashdot: a social news website featuring news stories on science, technology, and politics, that are submitted and evaluated by site users and editors.
• Social Media. Social media are sources of continuous, updated information and dialogue. It is often possible to model the users and access their history and metadata (e.g., preferences, habits, geographical position, personality, and language style). It is also possible to reconstruct the social graphs, i.e., the social relations between users and how they are connected. Several social media, to maximize the exploitation of data and render a higher amount of analyses possible, make their data available through APIs, allowing developers to extract datasets. To answer RQ3.1, we analyzed white literature to find mentions of industrial products, to conduct experiments or to compute quality attributes. Figure 13 reports the number of mentions in white literature for these technologies. The highest number of mentions (14) for a product from the industry was obtained by Alexa, the platform from Amazon that provides developer sets for the creation of skills, i.e., sets of voice-driven capabilities [80]. Amazon has defined several metrics to evaluate the performance of Alexa skills (e.g., in terms of time and size of the conversations) [83]. Alexa is closely followed in mentions by the Facebook platform (13), which offers developer tools allowing the creation of bots on the Messenger messaging platform. Other frequently mentioned products are Siri, the conversational agent developed by Apple, Cortana by Microsoft (that comes with the definition of many measurable quality attributes [120,149]) and Watson by IBM (for which a series of best practices have been defined [82]). To answer RQ3.2, we analyze how much the selected white literature and grey literature sources contributed to the answers to the previous research questions of the paper.
In Figure 14, we report how many metrics for each category were exclusively present in white literature and grey literature and how many are common to both types of sources. A total of 27 quality attributes (23% of the total) were only presented or discussed in grey literature sources. The categories where the contribution of grey literature was more significant were those closest to the user point of view, i.e., the User-Centred, Standard Questionnaires and Response-related quality attributes. This result can be justified by the fact that the grey literature on the field of conversational agents is typically more related to real-world measurements on commercial chatbots. Hence, it is based on the measurements of responses provided to real users of such systems. Conversely, white literature on the topic is more related to the definition of models to drive the conversation between the human and the conversational agent. Hence, it favours the discussion of Low-level semantic metrics. We also observe a predominancy of white literature in the exploration of relational attributes of the conversational agents, which involve human science-related analyses. In Figure 15, we report the number of datasets exclusively mentioned, used, or presented in white literature and grey literature, and the number of ones common to both types of sources. The contribution from grey literature to the knowledge gathered to answer RQ2.3 is not negligible, since 5 of the 28 datasets are mentioned only in grey literature sources (while 22 are mentioned only in white literature, and one, Twitter, is mentioned in both the types of literature). More specifically, the medical databases and Schema.org (see

Summary of Findings
The main objective of our work was to classify quality attributes for conversational agents and describe the different categories under which they can be filed. To that extent, we considered both white and grey literature to analyze quality attributes mentioned and utilized in peer-reviewed manuscripts, procedures, and evaluations used by practitioner and industrial literature.
The first goal was to produce a mapping of all the studies, subdividing them according to the type of contribution provided and the research methodology employed. We found that the white literature sources discussing conversational interface evaluation and assessment were homogeneously distributed between the different categories. In contrast, grey literature leaned towards presenting quality attributes, guidelines, and frameworks, with less formalized models, metrics, and presentation of chatbots and datasets. This result can be justified by the nature of most of the grey literature sources considered, e.g., blog posts that are less likely to feature quantitative studies than formally published academic works.
By analyzing the distribution per year of the typologies of contributions in white literature from 2010 to 2021, we could deduce two principal trends: in general, more peer-reviewed papers about conversational interface evaluations have been published in recent years than at the beginning of the decade; furthermore, we can observe an increasing trend in the number of papers defining and/or using systematically quality attributes and metrics for the evaluation of conversational agents. Identifying these trends can encourage researchers in the field of software metrics to adopt established frameworks and software metrics instead of defining new ones, given that a relevant corpus of quality attributes has been defined in the latest years.
To answer RQ2.1, we built a taxonomy of quality attributes for conversational agents, obtaining ten different typologies of attributes grouped in four macro-categories. From a strictly numerical perspective, the most populated macro-category was the one including quantitative metrics (51 out of a total of 123 different metrics). We found that the most commonly adopted and mentioned quality attributes belonged to the qualitative categories related to the relationship with the user, the conversation with the relational agent, and the user's perception of this conversation. This result suggests that the research in conversational agent evaluation currently lacks a wide adoption of quantitative methods.
Our findings answering RQ2.2 suggest a lack of structured, comparable, and standardized evaluation procedures since we could find only four different frameworks in the selected pool of sources. It is worth underlining that-in the studies that we analyzed-we did not find mentions of fully structured approaches to aid decision-making approaches in the design of conversational interfaces, based on either qualitative or quantitative attributes that have been measured. A recent and promising example in the literature is presented in a work by Radziwill et al., where the computation of quality attributes for chatbots is integrated into an Analytic Hierarchy Process (AHP). This process can be used to compare two or more versions of the same conversational system. These versions can either be a current available one (as-is) and one or more future ones in development (to-be) [1].
Finally, to answer RQ3, we compared the contribution of white and grey literature to the facets analyzed in the previous research questions. By considering the contribution to RQ2.1 and RQ2.3, we found that 27 out of 123 metrics (the 23%) and 5 out of 28 datasets (the 18%) for the evaluation of conversational agents are only mentioned in grey literature sources from our pool. These results underline how taking into account manuscripts that are not peer-reviewed provides an important added value for researchers when the research objective is to assess and evaluate conversational agents.

Threats to Validity
Threats to Construct validity for a Literature Review concern possible failures in the coverage of all the studies related to the review topic. In this study, we mitigated the threat by selecting five essential sources of white literature studies and including grey literature to consider chatbots, datasets, and evaluation metrics that are not presented in peer-reviewed papers.
For both typologies of literature, we applied a reproducible methodology based on established guidelines. To broaden the research as much as possible, we included the most commonly used terms in the search strings, as well as various synonyms. However, it is still possible that some terms describing other relevant works in the literature may have been overlooked. The definition and identification of the right keywords for the search string could be influenced. For this particular study, the missing unanimity about some concepts is defined (e.g., the same concepts are defined as quality attributes or metrics in different studies).
Regarding grey literature, there is a possibility that some relevant chatbots, frameworks, and evaluation procedures were not included in the analysis due to the inability to access the documents where they are presented.
Threats to Internal validity are related to the data extraction and synthesis phases of the Literature Review. All the primary sources resulting from the search strings application were read and evaluated by all authors and collaborators of this study to assess their quality, apply inclusion and exclusion criteria, and extract the information to answer the Research Questions of the study. Hence, the validity of the study is threatened by possible errors in the author's judgment when examining the sources. It may suffer from misinterpretations of the original content of the papers. This threat was mitigated by multiple readings of the same sources by the different researchers and discussions among the same people about the disagreements during the review phase.
Threats to External validity concern the generalizability of the findings of the Literature Review. We limited our investigation to textual chatbots or voice chatbots whose inputs can be reconducted to text for this study. It is not ensured that the found quality attributes can be generalized to any category of chatbots available in the literature.
Due to the differing accessibility of grey literature sources, it is also possible that this review provides only partial geographical coverage of commercial chatbots.

Conclusions and Future Work
In this study, we defined, conducted, and documented the results of a Multivocal Literature Review, i.e., an SLR conducted by taking into account different typologies of sources-not only peer-reviewed white literature. We applied this research methodology to the field of assessment and evaluation of conversational interfaces.
The principal goal of our review was to identify quality attributes that are used by either researchers or practitioners to evaluate and test conversational interfaces. We came up with 123 different quality attributes, four metric tools and frameworks for systematic and automated evaluations of conversational interfaces, and 28 datasets commonly using for performing evaluations. The quantity of information coming from grey literature only can be deemed a confirmation of the necessity of including grey literature in this very specific topic, since evaluation methods can be useful to researchers are often disseminated in non peer-reviewed sources.
The primary objective of this manuscript is to serve as a comprehensive reference for researchers and practitioners in the field of conversational agents development and testing, providing a categorization and a set of references to available quality attributes already defined in the literature. As future extensions of the present study, we plan to explore the possibility of developing automated or semi-automated tools to collect the measurable quality attributes for existing chatbots. We also plan to find strategies to prioritize the different quality attributes and implement a subset of them into a framework that can serve as a global means of evaluating any chatbot type. Finally, we aim to analyze multiple empirical measurements on a diverse set of chatbots, both commercial and academic, to seek correlations and dependencies between different metrics.
Funding: This work was partially funded by the H2020 EU-funded SIFIS-Home (GA #952652) project.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable, the study does not report any data.