AI Testing for Intelligent Chatbots—A Case Study

Gao, Jerry; Agarwal, Radhika; Garsole, Prerna

doi:10.3390/software4020012

Open AccessArticle

AI Testing for Intelligent Chatbots—A Case Study

by

Jerry Gao

^1,2

,

Radhika Agarwal

^2,*

and

Prerna Garsole

¹

Department of Computer Engineering, College of Engineering, San Jose State University, San Jose, CA 95192, USA

²

ALPSTouchStone, Inc., San Jose, CA 95134, USA

^*

Author to whom correspondence should be addressed.

Software 2025, 4(2), 12; https://doi.org/10.3390/software4020012

Submission received: 7 February 2025 / Revised: 3 May 2025 / Accepted: 6 May 2025 / Published: 15 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

The decision tree test method works as a flowchart structure for conversational flow. It has predetermined questions and answers that guide the user through specific tasks. Inspired by principles of the decision tree test method in software engineering, this paper discusses intelligent AI test modeling chat systems, including basic concepts, quality validation, test generation and augmentation, testing scopes, approaches, and needs. The paper’s novelty lies in an intelligent AI test modeling chatbot system built and implemented based on an innovative 3-dimensional AI test model for AI-powered functions in intelligent mobile apps to support model-based AI function testing, test data generation, and adequate test coverage result analysis. As a result, a case study is provided using a mental health and emotional intelligence chatbot system, Wysa. It helps in tracking and analyzing mood and helps in sentiment analysis.

Keywords:

chatbots; smart AI chat system testing; 3D intelligent chat test modeling; test generation; data augmentation; AI test result validation

1. Introduction

An intelligent chat system refers to a computer-based intelligent chat system that has built-in AI solutions that support diverse system-and-user interactions with customers and answer questions, chat different subjects, and play as an intelligent agent (such as a finance agent, customer service agent, real estate agent, and so on). An intelligent chat system usually is developed with AI techniques, natural language processors, and machine models to accept, process, and understand diverse questions as inputs from users, generate appropriate responses to answer their questions, facilitate them to complete transactions, or walk users through a customer support process and resolve their issues, typically represented as a virtual avatar.

Due to the advantages of intelligent chat systems (such as chatbots) over other customer support options, using chatbots is an effective method of customer engagement. It is becoming popular in business operations due to the advantages in the following areas: (a) its easy connectivity with diverse social media channels and networks, such as websites, email, SMS, or messaging applications; (b) cost reduction in call centers; (c) easy collection of consumer data from support interactions; and (d) smart interactions with customers.

According to a recent market analysis report [1], the chatbot market is expected to grow by USD 1.11 billion, progressing at a CAGR of almost 29% during the forecast period. Furthermore, organizations across various industry verticals are increasingly adopting AI to make more informed decisions. This provides enhanced customer service, opening several new opportunities for market vendors. In the academic field, comprehensive reviews on plenty of state-of-the-art research outcomes in dialogue systems [2] have been carried out, which is becoming a heated research topic.

This growing trend provides good business and research opportunities and brings technical challenges and needs in chatbot system testing and automation. As a special type of intelligent mobile app, testing smart chat systems must encounter similar challenges and issues [3]. For instance, smart mobile apps and modern intelligent chat systems have the following features:

-: Development based on NLP and machine learning models based on big data.
-: Rich media inputs and outputs, text, audio, and images.
-: Text-to-speech and speech-to-text functions.
-: Text generation, synthesis, and analysis capabilities.
-: Uncertainty and non-deterministic in response generation.
-: Understanding selected languages and diverse questions and generating responses with diverse linguistic skills. These special features bring many issues and challenges in quality testing and evaluation of intelligent chat systems (like chatbots).

Certain issues [4] have been faced in testing intelligent chatbot systems:

Issue 1—Lack of quality validation criteria and quality assurance standards leads to difficulty in establishing well-defined, clear, and measurable quality testing requirements.
Issue 2—Lack of well-defined systematic quality testing methods and solutions.
Issue 3—Lack of automatic test tools with well-defined, adequate quality test coverage.
Issue 4—Lack of automatic adequate quality test coverage analysis techniques and solutions.

The current chatbot testing methods do not consider some important features of intelligent chat systems, such as chatting patterns and steps, domain knowledge, memory, etc., which are discussed in this work. The focus here is on various testing approaches used to validate intelligent chatbot systems and several non-functional quality testing services required for the system. Major causes of quality validation challenges include uncertainty, NLP/ML model diversity, continuous learning, platform diversity, big data diversity, and rich media inputs/outputs. To overcome these challenges and test for diversity, quality testing adequacy was carried out using some conventional testing methods. However, these methods were developed without considering AI-powered features and do not support rich media input. This led to the study of AI test modeling for intelligent chatbot systems.

The novelty of this paper lies in an intelligent chatbot-based 3-dimensional (input, context, and output) AI test modeling of the mobile app, Wysa, which is an emotional and mental coaching chatbot system. The main focus of the study comprises the following:

Application of AI-based 3-dimensional test modeling, providing a reference classification test model for chatbot systems.
Discussion of various test generation approaches for smart chatbot systems.
AI to support test augmentation (positive and negative) for chatbots.
Validation and adequacy of the test result using an AI-based approach for smart chatbot systems.
Results and analysis of the AI test, specifically for the Wysa mobile app.

The 3D approach includes all kinds of classification carried out separately to categorize all inputs, providing a vast area for coverage compared to the 2D approach. The chatbot can analyze interactions in multiple dimensions rather than just linear text-based input. This means that 3D AI chatbots can integrate and understand spatial relationships, emotions, speech, text, gestures, and object recognition, whereas 2D chatbots are limited to text/voice-based interactions.

This paper discusses the testing of intelligent chat systems, including basic concepts, testing scopes, approaches, and adequate validation criteria. In addition, this paper discusses 3D test modeling and provides a reference classification test model. This paper is structured as follows: Section 2 reviews the related research work on current smart chat systems. Section 3 discusses the testing of the smart chat system, including basic concepts, the validation process, the testing focus points, the quality parameters, the quality validation methods, and the approaches. Section 4 discusses 3D intelligent test modeling and analysis with classified test modeling. Section 5 presents test generation and data enhancement for intelligent chatbot systems. Section 6 reports test results and validation, with a statistical analysis of the results in Section 7. The concluding remarks of this work are presented in Section 8.

2. Literature Review

Many tools are available for chatbot development, but testing support for chatbots is very limited. Many existing chatbot development platforms, such as Dialogflow, Watson, or Lex, provide a web chat console that allows for the manual and informal testing of the chatbots. Only a few platforms, like Dialogflow, can provide debug facilities and check the quality of chatbots by detecting intents with similar training phrases. Some companies have developed chatbot testing tools. For example, haptik.ai provides a testing tool that allows for automatic interaction with chatbots through simple scripts and can be integrated with automation servers such as Jenkins. Botium can also be incorporated into test flows by using it.

Table 1 represents a detailed survey based on AI test models of the existing literature and their differences in approach, test modeling, generation, and augmentation that have been discussed in this study.

For academic proposals, the authors introduced an approach for functional testing of a hotel booking chatbot and applied AI planning techniques to generate test cases traversing a conversation flow [10]. Divergent inputs (word order errors, incorrect verb tense, synonyms) were created from an initial set of utterances [11], and divergent examples were generated by lexical substitutions [12] that retain the same meaning based on earlier studies. In recent years, there has been a surge of work on evaluating chatbots based on test models. Some researchers evaluated chatbots using approaches based on natural language processing (NLP). Besides NLP-based test models, other recent test models can also be leveraged to test dialogue systems. The weakness of existing chatbot frameworks [13] was that different goals in diverse domain-oriented intelligent chat systems were not considered. To address this, the authors adapted the Goal-Question-Metric model [14], which is a top–down hierarchical model. The top level starts with a goal refined into several questions. Each question is a metric, objective, or subjective. The natural language conversations flow diagram (NLCFD)-based test execution [15] was then proposed as the set of specification traces, each of which was used to generate a set of test cases and then repeatedly execute each test case until all associated possible paths were covered.

Recently, the state of the art of reinforcement learning (RL) methods has been studied for their possible use for different problems of NLP, focusing primarily on conversational systems [16]. The applications of RL in NLP focus on its effectiveness in acquiring optimal strategies, particularly in healthcare settings [17]. Some authors have worked on the review of quality assurance and test automation of the chatbot systems [3,4,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]. Although these existing models have found various chatbot issues, a comprehensive test model is lacking in addressing the special test focus points and needs in domain knowledge, subjects, memory, diverse questions, and answers in the case of a mobile app chatbot system. This paper focuses on test modeling challenges and needs and provides a reference classification test model for the intelligent chatbot system, Wysa. The work provides several testing approaches that validate the intelligent chatbot system, followed by a non-functional quality testing service that helps improve its quality. The main aim of this study is to provide a 3D approach for AI test modeling, where the tree model includes input, context, and output classification.

3. Understanding Testing Intelligent Chatbot Systems

Testing intelligent chat systems refers to system quality validation activities using well-defined systematic methods and solutions to achieve the following three objectives:

Objective 1—Validating the system chat intelligence and functions. It focuses on checking how well an intelligent chat system can interact with users to accept and correctly process incoming inputs in text/image/audio and generate the appreciated responses/answers. These interactive chats must be validated in a predefined domain-specific knowledge scope, including content subjects, topics, Q&A patterns, and interaction flows based on the predefined chat intelligence and functions at the system level.
Objective 2—Measuring and assuring system non-function quality by evaluating its QoS parameters from different perspectives: (a) language-oriented perception, diversity, and similarity; (b) system-oriented parameters, such as security, reliability, scalability, availability, and performance, (c) chatting-related parameters, such as user satisfaction, response accuracy/relevance, content correctness, and consistency.
Objective 3—Evaluating the system to see if it is trustworthy for users and customers based on user-oriented quality validation and evaluation.

3.1. Testing Approach

There are several testing approaches to validating intelligent chatbot systems. Table 2 compares AI testing, AI-based software testing, and conventional software testing methods. There is a present and future need for AI testing due to several conditions: (a) Many current and future software will be built with AI based on features and functions. (b) Existing techniques and tools are not adequate to test AI based on features and functions. (c) Lack of well-defined and experienced approved quality validation models and assessment criteria. (d) Lack of AI-based testing methods and solutions for AI software.

In addition, some common facts and features of current AI mobile software apps were found after numerous ones were validated:

Limited data training and validation: Most of our tested mobile apps with AI features are built based on machine learning models and techniques, and trained and validated with limited input datasets under ad hoc contexts.
Data-driven learning features: Many mobile apps with learning features provide static and/or dynamic learning capabilities that affect the under-test software outcomes, results, and actions.
Uncertainty in system outputs, responses, or actions: Since existing AI-based models are dependent on statistical algorithms, this causes uncertainty in the outcome of AI software. We experienced many mobile apps with AI functions that generated inconsistent outcomes for the same input test data when context conditions are changed.

Different testing approaches are given below.

Conventional Testing Methods—Using conventional software testing methods to validate any given smart chatbots and applications online or on mobile devices/emulators. These include scenario-based testing, boundary value testing, decision table testing, category partition testing, and equivalence partition testing.
Crowd-Sourced Testing—Using crowd-sourced testers (freelanced testers) to perform user-oriented testing for given smart chatbots and systems. They usually use ad hoc approaches to validate the given systems as a user.
Smart Chat Model-Based Testing—Using model-based approaches to establish smart chat models to enable test case and data generation, and support test automation and coverage analysis.
AI-Based Testing—Using AI techniques and machine learning models to develop AI-based solutions to optimize the smart chatbot system quality test process and automation.
NLP/ML Model-Based Testing—Using white-box or gray-box approaches to discover, derive, and generate test cases and data focusing on diverse ML models, related structures, and coverage.
Rule-based testing—This methodology employs rule-based testing to generate tests and data for handling intelligent chat systems. While effective for traditional rule-based chat systems, it faces numerous challenges in addressing modern intelligent chat systems due to their unique characteristics and the complexities introduced by NLP-based machine learning models and big data-driven training.
System Testing—In this strategy, quality of service (QoS) parameters at the system level will be chosen, assessed, and tested using clearly defined evaluation metrics. This aims to quantitatively validate both system-level functions and AI-powered features. Common system QoS factors encompass reliability, availability, performance, security, scalability, speed, correctness, efficiency, and user satisfaction. AI-specific QoS parameters typically involve accuracy, consistency, relevance, as well as loss and failure rates.
Leaning-based testing—In this method, the activities and tests conducted by human testers are monitored, and their test cases, data, and operational patterns are collected, analyzed, and learned from to understand how they design tests and data for a specific smart chat system. Additionally, any bugs they uncover can be learned from and utilized as future test cases.
Metamorphic Testing—Metamorphic testing (MT) is a property-based software testing technique, which can be an effective approach for addressing the test oracle problem and test case generation problem.

3.2. Testing Scope

Testing the scope for chatbot systems provides an understanding of what testing must be performed to ensure that the chatbot is responding and working. It is developed to standard and offers a user-friendly experience. It can be used to make decisions about test planning, deciding the resources to be allocated for tests, and deciding the risks that are likely to be encountered in tests. The testing scope is shown in Figure 1a for smart chat systems, including the following:

Domain Knowledge—Validate a smart chatbot system to see how well the current chatbot system has been trained for selected domain knowledge and demonstrate its knowledge at different levels during its chats and communication with clients.
Chatflow Patterns—Validate a smart chatbot system to see how well it is capable of carrying out chats in diverse chatting flow patterns.
Chat Memory—Validate Chatbot’s memory capability about clients’ profiles, cases, chat history, and so on.
Q&A—Validate the chatbot’s Q&A intelligence to see how well an under-test smart chat system is capable of understanding the questions and responses from clients and providing the responses like a human agent.
Chat Subject—Validate the Chatbot’s intelligence in subject chatting to see how well an under-test chatbot system is capable of chatting with clients on selected subjects.
Language—Validate a smart chatbot’s language skills in communicating with clients in a selected natural language.
Text-to-speech—Check how well the under-test chatbot system can convert text to speech correctly.
Speech-to-text—Check how well the under-test chatbot system can convert speech audio to text correctly.

Figure 1. (a) Testing scope for intelligent chat functions [4]. (b) Non-function testing scope for intelligent chatting [4].

The non-function quality testing service required for the system shown in Figure 1b includes twelve quality validation focuses as follows:

Accuracy—To make sure that the system can accurately process and understand NLP and rich media inputs; generate accurate chat responses in a variety of subject contents, languages, and linguistics using a range of tests with a variety of chat patterns and flows; and assess system accuracy based on well-defined test models and accuracy metrics.
Consistency—To ensure that the system can consistently process and understand a wide range of NLP, rich media inputs, and client attention, as well as generate consistent chat responses in a variety of subject contents, languages, and linguistics using a variety of tests with chat patterns and flows, and evaluate system consistency based on well-defined test models and consistency metrics.
Relevance—Use established testing models and profitability metrics [34] to assess system relevance. This ensures the system can comprehend and process a wide array of NLP and multimedia inputs, identify pertinent subjects and concerns, and produce appropriate chat responses across diverse domains, subjects, languages, and linguistic variations through varied testing methodologies.
Correctness—Assess the accuracy of a chat system’s processing and comprehension of NLP and/or rich media inputs, its ability to provide accurate replies related to domain subjects, contents, and language linguistics, using well-defined test models and metrics.
Availability—Ensure system availability based on clearly defined parameters at various levels, such as the underlying cloud infrastructure, the enabling platform environment, the targeted chat application SaaS, and user-oriented chat SaaS.
Security—Assess the security of the system by utilizing specific security metrics to examine the chat system’s security from various angles. This includes scrutinizing its cloud infrastructure, platform environment, client application SaaS, user authentication methods, and end-to-end chat session security.
Reliability—Assess the reliability of the system by employing established reliability metrics at various tiers. This encompasses evaluating the reliability of the underlying cloud infrastructure, the deployed and hosted platform environment, and the chat application SaaS.
Scalability—Evaluate system scalability based on well-defined scalability metrics in different perspectives, including deployed cloud-based infrastructure, hosted platform, intelligent chat application, large-scale chatting data volume, and user-oriented large-scale accesses.
User satisfactory—Assess user satisfaction with the system by employing well-defined metrics from various angles. This includes analyzing user reviews, rankings, chat interactions, session success rates, and goal completion rates.
Linguistics diversity—Assess the linguistic diversity of the intelligent chat system in its ability to support and process various linguistic inputs, including diverse vocabularies, idioms, phrases, and different types of client questions and responses. Additionally, evaluate the system’s linguistic diversity in the responses it generates, considering domain content, subject matter, language syntax, semantics, and expression patterns.
Performance—Assess the performance of the system using clearly defined metrics related to system and user response times, processing times for NLP-based and/or rich media inputs, and the time taken for generating chat responses.
Perception and understanding—Assess the system’s perception and comprehension of diverse inputs and rich media content from its clients using established test models and perception metrics.

Figure 2 shows the intelligent chat quality metrics, which contribute to a better understanding.

4. AI Test Modeling for Intelligent Chatbot Systems

In software testing, the major objective of a test model is to provide a fundamental base for building systematic test methods and defining adequate test coverage assessment criteria. There is a strong demand for well-defined and practical test models for black-box validation of AI-powered functions in modern intelligent mobile apps and smart systems.

4.1. AI Test Modeling and Analysis

There are three ways to derive 3D AI test models for AI-powered functions, including manual derivation, tool-based interaction, and learning-based model derivation. The current version of the AI test tool supports interactive test modeling for test engineers. Figure 3 shows the 3D AI function test model with

T (F_{i})

, the AI function.

The 3D intelligent chat test modeling processes are discussed as follows:

The 3D Intelligent Chat Test Modeling Process I—In this process, for the selected intelligent chat function feature, set up a 3D classification tree model in three steps. (1) Define an intelligent chat input context classification tree model to represent diverse context classifications. (2) Define an intelligent chat input classification tree model to represent diverse chat input classifications. (3) Define an intelligent chat output classification tree model to represent diverse chat output classifications. Figure 4a shows the schematic diagram for a single-feature 3D classification tree model.
The 3D Intelligent Chat Test Modeling Process II—For the selected intelligent chat function features, set up a 3D classification forest tree model in three steps. (1) Define an intelligent chat input context classification tree model to represent diverse context classifications. (2) For each selected intelligent chat AI feature, define one intelligent chat input classification tree model to represent diverse chat input classifications. One input classification decision table is generated based on the defined input tree model. As a result, a chat input forest is created, and a set of input decision tables are generated. (3) For each selected intelligent chat AI feature, define one intelligent chat output classification tree model to represent diverse chat output classifications. One output classification decision table is generated based on the defined output tree model. As a result, a chat output forest is created, and a set of output decision tables is generated. Figure 4b shows the schematic diagram for the 3D classification forest tree model for selected features of intelligent chat functions.

The 3D Intelligent Chat Test Modeling Process III—For each selected intelligent chat function feature, set up a 3D classification tree model in three steps similar to that of process I. A corresponding 3D decision table will be generated. After a modeling process, a set of 3D tree models will be derived, and related 3D decision tables will be generated.

4.2. AI-Based Test Automation for Mobile Chatbot—WYSA((version 890 v3.7.8))

Wysa is an AI-powered coaching chatbot designed to provide mental health and emotional well-being support. It acts as a virtual companion and coach, offering a safe and confidential space for users to express their thoughts, feelings, and concerns. As an AI Coach, Wysa employs evidence-based techniques from cognitive behavioral therapy (CBT), dialectical behavior therapy (DBT), mindfulness, and other therapeutic modalities to support users in managing stress, anxiety, depression, and other mental health challenges. Wysa offers a wide range of features and functionalities, including mood tracking, conversation analysis, personalized self-care exercises, goal setting, and coping strategies. It can help users identify and challenge negative thoughts, practice relaxation techniques, develop healthy habits, and set and track progress toward their well-being goals. The Wysa AI feature scope is included in Table 3. Figure 5 shows the input, context, and output classification tree sample for the AI-based test automation for the mobile chatbot, Wysa.

4.3. Classified Test Modeling for Intelligent Chat System

Figure 6 shows the sample chat of the mobile chatbot system, Wysa, that works as a mental coach replying with different options and thoughts that a person can work on for self-care. The chat shows the way Wysa tries to help a sad person uplift their mood by suggesting some links to mindful compassion. It also shows that Wysa remembers the person chatting and refers to accessing the journey tab for the previous chat. One can see the interactive pattern of Wysa. It tends to reply to short questions like “tell me more”, to W-questions like “when, where, why, what”, and knowledge questions based on analysis, application, or comprehension. It shows how Wysa tries to keep an interactive chat by suggesting a game.

This subsection gives a classification test model for intelligent chat systems. It could support test design for the following types of intelligent chat testing:

Knowledge-based test modeling—This is useful to establish test models focusing on selected domain-specific knowledge of a given intelligent chat system to see how well it understands the received domain-specific questions and provides proper domain-related responses relating to responses. For service-oriented domain intelligent chat systems, one could consider diverse knowledge of products, services, programs, and customers. Figure 7a shows a knowledge-based test model, while Figure 7b shows a specific example where American history is considered the knowledge domain and has topics and subtopics as shown in the figure.
Subject-oriented test modeling—This is useful to establish test models focusing on a selected conversation subject for a given intelligent chat system to see how well it understands the questions and provides proper responses on diverse selected subjects in intelligent chat systems. Typical conversational subjects include travel, driving directions, sports, and so on. Figure 7c shows an elaborate subject-based test modeling for travel and certain queries that a person generally has about it.
Memory-oriented test modeling—This is useful to establish test models for validating the memory capability of a given intelligent chat system to see how well it remembers users’ profiles, questions, and related chats, as well as interactions. Figure 8 shows the text-based input tree for short and long-term memory.
Q&A pattern test modeling—This is useful to establish test models for validating the diverse question and answer capability of a given intelligent chat system to see how well it handles diverse questions from clients and generates different types of responses. Figure 9 shows the text-based input tree for the question and answer pattern.
Chat pattern test modeling—This is useful to establish test models for validating diverse chat patterns for a given intelligent chat system to see how well it handles diverse chat patterns and interactive flows.
Linguistics test modeling—This is useful to establish test models to validate language skills and linguistic diversity for a given intelligent chat system. Four aspects of test modeling for linguistics include sentences, diverse lexical items, different types of sentences, semantics, and syntax.

5. Test Generation and Data Augmentation for Intelligent Chatbot Systems

This section aims to discuss different ways to generate and augment the text and data for intelligent chatbot systems.

5.1. Test Generation

In this section, different test generation approaches for smart chatbot systems are discussed, namely conventional, random, test model-based, AI-based, and NLP/ML models.

Conventional Test Generation—The conventional testing criteria, including branch coverage, have been discussed in detail earlier [35]. It is simple and easy to use conventional software testing methods to generate tests for a selected smart chatbot system, but there are certain limitations to this method, i.e., (1) it is difficult to perform test result validation using a systematic way, (2) it is hard to evaluate adequate test coverage, and (3) costs to generate chat input tests manually are high.
Random Test Generation—Using a random generator to select random chatbot tests as inputs from a given chatbot test DB to validate the under-test intelligent chatbot system. Random text generation can be used to generate unique and engaging text that can be used in various applications, from creative writing to marketing content. An improvised random test generation has been studied along with a feedback study by executing test inputs [36].
Test Model-Based Test Generation—It is a model-based approach to enable automatic test generation for each targeted AI-powered intelligent chat function/feature [37]. It supports model-based test coverage analysis and complexity assessment. It is important as AI-based test data augmentation solutions and result validation are needed. This paper uses model-based test generation, which is shown in Figure 10, and explains the semantic working for the same. It shows how a 3D classified tree model is used to obtain a decision table using context, input, and output classifications, which are further segmented to validate the AI-based test results.
AI-Based Test Generation—Using AI techniques and machine learning models to generate augmented tests and test data for each selected test case. There are two types of data generated: (1) synthetic data, which are generated artificially without using real-world images; and (2) augmented data, which are derived from original images with some sort of minor geometric transformations (such as flipping, translation, rotation, or the addition of noise) to increase the diversity of the training set. This method also comes with its challenges, including (1) the cost of quality assurance of the augmented datasets; (2) research and development to build synthetic data with advanced applications; and (3) the inherent bias of original data persisting in augmented data.
NLP/ML Model Test Generation—It uses white-box or gray-box approaches to discover, derive, and generate test cases and data based on each underlying NLP/ML model to achieve model-based structures and coverage. NLP/ML models have been used in enhancing the computer programming exercise generation in the automatic generation of text and source code for the benefit of both students and teachers [38].

5.2. Data Augmentation

Text augmentation is a technique used in natural language processing (NLP). It generates new examples of text by applying transformations to the original text data. The main purpose of text augmentation is to increase the training dataset size and introduce variations in text. Generally, the common transformations seen here are synonym replacement, random insertion, random deletion, random swap, text masking, and character-level augmentation. It increases dataset diversity and enhances model performance and generalization in NLP tasks. They are used in text classification, sentiment analysis, named entity recognition, machine translation, etc. There are two types of augmentation techniques, namely positive and negative text augmentation.

The positive augmentation comprises synonym and keyboard augmentation. (a) Synonym augmentation replaces words in a sentence with their synonyms. The purpose is to introduce lexical variation and expand the vocabulary used in the training data. By replacing words with their synonyms, the augmented data expose the model to different word choices that convey similar meanings. This technique helps improve the model’s ability to handle diverse vocabulary and increases its flexibility in generating or understanding sentences with alternative word usage. (b) Keyboard augmentation simulates errors that can occur during manual typing on a keyboard. These errors can include accidental character swaps, deletions, or insertions, often caused by typographical mistakes or the proximity of keys on a keyboard. By introducing such errors into the text data, keyboard augmentation helps the model learn to handle and correct these types of errors. It improves the model’s ability to recognize and interpret text data with typographical variations, enhancing its robustness in real-world scenarios.

The negative augmentation comprises random word swap augmentation, random word delete augmentation, Optical Character Recognition (OCR) augmentation, random char swap augmentation, and random char insert augmentation. Table 4 shows the example of different negative augmentations.

6. AI-Based Test Result Validation and Adequacy Approaches for Smart Chatbot Systems

AI-based test result validation is required to develop well-defined, adequate validation models and criteria to address and present the special features and needs in testing AI-based functional features, such as object detection and classification, recommendation, and prediction features. Here are some of the adequacy approaches for smart chatbot systems:

Conventional Testing Oracle—Test Oracle is a mechanism that can be used to test the accuracy of a program’s output for test cases. Conceptually, consider testing a process in which test cases are given for testing and the program under test. The output of the two is then compared to determine whether the program behaves correctly for test cases. Ideally, an automated oracle is required, which always gives the correct answer. However, oracles are often human beings who mostly manually calculate the program’s output. Consequently, when there is a discrepancy between the program and the result, we must verify the result produced by the oracle before declaring that there is a defect in the result.
Text-Based Result Similarity Checking—For text-based chat outputs, AI-based approaches can be used to perform text similarity analysis to determine if the intelligent chat test results are acceptable. Figure 11a,b show the language-based similarity evaluation. Lexical similarity is a measure of two texts’ similarity based on the intersection of word sets from the same or different languages. Lexical similarity scores range from 0 to 1, indicating no common terms between the two texts, to 1, indicating total overlap between the vocabulary. Figure 11c shows the second approach, keyword-based weighted text similarity evaluation with the similarity formula.
Image-Based Similarity Checking—For image-based outputs, AI-based approaches can be used to perform image similarity analysis to determine if the intelligent chat test results are acceptable. Diverse machine learning algorithms and deep learning models can be used to compute the similarity at the different levels and perspectives between an expected output image and a real output image from the under-test chatbot system, including objects/types, features, positions, scales, and colors.
Audio-Based Result Similarity Checking—For audio-based outputs, AI-based approaches can be used to perform radio similarity analysis to determine if the intelligent chat test results are acceptable. Diverse machine learning algorithms and deep learning models can be used to compute the similarity (at different levels and perspectives) between an expected output audio and real output audio from the under-test chatbot system, including audio sources/types, audio features, timing, frequencies, noises, and so on.

Model-Based Test Coverage for Smart Chatbot System

One can establish a 3D AI test classification decision table for each targeted under-test computer vision intelligence based on the established 3D AI tree model. Using this test classification decision table as a test scheme, a set of 3D AI test classification test case sets (known as 3DT-Set) could be generated. The four test coverage criteria are defined below:

Three-dimensional AI test classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any 3D element—T (CT-x, IT-y, OT-z) in the 3D AI test classification decision table.
Context classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any rule in a context classification decision table.
Input classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least one test case for any rule in an input classification decision table.
Output classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least a tree case for any rule in an output classification decision table.

Here, the intent is only to provide various coverage criteria. This paper focuses on test modeling and data generation, so a detailed explanation and an example of coverage criteria will be discussed in future studies.

7. AI Test Result and Analysis

Wysa uses natural language processing (NLP) to understand and respond to messages in real time. It provides access to a library of mental health resources, including articles, videos, and guided meditations. Wysa provides insights into the user’s mental health patterns while tracking the user’s mood over time. It offers exercises and techniques supported by the facts provided to assist the user in managing various mental health issues and provides a variety of chat modes, such as introspective, coaching, and conversational, to meet the user’s needs and preferences. A built-in gratitude journal and mood tracker are included to assist users in developing a positive mindset and monitoring their growth.

This section initially provides the data collection and preprocessing information for a better understanding of the results and then presents four different testing criteria, including general responses, memory-based responses, emotive reflexes, and Q&A-based interaction performed on Wysa, and its performance during manual and automated testing. The 3D diagram provides a structured, multi-dimensional representation of the AI’s responses and text complexity, helping in evaluation and debugging.

7.1. Data Collection and Preprocessing

The chatbot testing framework was evaluated using a structured dataset comprising the following:

User interactions from the Wysa chatbot, collected over 1 month.
Simulated test cases generated based on predefined user intents and real-world chatbot interactions.
Augmented data using NLP-based text transformations, including synonym replacement, word swapping, and OCR-induced variations to simulate noisy inputs.
Multimodal inputs (text, voice, and emotive indicators) processed using AI-based test scripts collected via the Wysa mobile app.

The preprocessing steps included the following:

Text Normalization: Removing stopwords, punctuation, and special characters.
Sentiment Analysis: Labeling chatbot responses based on emotional tone (positive, neutral, negative).
Context Embedding: Mapping conversation sequences into vectorized representations for contextual accuracy measurement.

7.2. General Responses

By creating domain-driven tests and test models, this test aims to determine how well a chat system can handle chat questions and answers related to a specific field based on domain-driven validation. This would assist in identifying and evaluating the functionality of an application. Figure 12 shows the 3D diagram for the general responses.

The app is strengthened, and its intended activities are emphasized. Figure 13 shows the test case results of manual and automation testing for the general chat responses with their respective pie-chart representations and bar-chart for multiple scenarios. The results from the figures clearly show that the number of test results passed is higher than the number of failed results, so we can include them for further evaluation.

7.3. Memory-Based Responses

The chatbot must remember the context of the conversation and user preferences. Previously discussed chats were used to answer a current question. It was tested to predict the outcome with information and questions. Figure 14 shows the 3D diagram for the general responses. A variety of scenarios were considered to assess both the chatbot’s past and present memory. For a chatbot like Wysa, memory testing is crucial, as it must remember user preferences and conversations to properly respond to what the user has been experiencing. Figure 15 shows the test case results of manual and automation testing for the memory-based responses.

7.4. Emotive Reflexes

The Wysa application uses AI technology to test and analyze these emotive reflexes, enabling users to better comprehend and manage their emotions. One of the key features of Wysa is its Emotional Reflexes Test area, which is designed to help users better understand their emotional reactions to different situations. This mental health chatbot makes use of a combination of cognitive behavioral therapy (CBT), meditation, and mindfulness techniques to help users manage stress, anxiety, and depression. Figure 16 shows the test case results of manual and automation testing for the emotional responses of the chatbot with their respective pie-chart representations and bar-charts for multiple scenarios.

A sample in Figure 17 shows the bug report, where the AI chatbot could not read the emoji and respond with an appropriate message. So chatbots tend to be in the learning process with larger inputs.

7.5. Q&A Interactions

In addition to responding to domain-relevant requests, the chatbot should also be able to answer domain-specific queries. It is possible to measure the intelligence of a chatbot by its ability to respond to complex, comprehensive questions using its memory and problem-solving skills. In addition to comprehensive questions, there are also short questions, loaded questions, and analytical questions. Figure 18 shows the test case results of manual and automation testing concerning the Q&A-based chat responses with their respective pie chart representations and bar chart for multiple scenarios.

8. Conclusions and Future Scope

Since the use of AI in chatbots for customer service, mental health support, and many other applications is trending, there must be effective testing procedures in place. In this paper, a 3D AI test modeling involving a discussion of the input, context, and output of a chatbot was proposed to offer methods for testing the chatbot. This was illustrated in the Wysa case study to show how this framework improves chatbot testing, especially in the generation of test cases for testing, augmentation of the data to be used for testing, and validation of the results that arise from the testing. The findings highlight that while 3D AI test modeling improves test adequacy and coverage, challenges such as handling emotive responses, contextual memory, and unpredictable user inputs remain open for further enhancement.

To further advance chatbot testing methodologies, the following actions are recommended:

Automation of 3D AI Testing Pipelines—Developing automated tools that integrate test generation, augmentation, and validation within a continuous testing framework.
Incorporation of Reinforcement Learning—Enhancing chatbot adaptability by integrating reinforcement learning-based self-improving mechanisms for refining responses.
Contextual and Long-Term Memory Testing—Improving test models to evaluate how well chatbots retain and apply past interactions over extended periods.
Multimodal AI Testing—Expanding the 3D model to include voice, gesture, and visual input validation, making it applicable to voice assistants and AR/VR interfaces.

Future studies can build upon this 3D AI test model and generalize it for further types of AI applications beyond chatbots, such as intelligent virtual assistants, AI-powered customer service agents, and emotionally aware conversational AI. These improvements will contribute to the creation of better and more flexible, multiple-input/output, and socially and ethically sustainable chatbot systems where the end users have a greater level of confidence and can be applied more extensively in different fields. Tools can be built to support chatbot automation and test data augmentation, which can simulate crime-based chatbots and call-based chat systems to support test modeling.

Author Contributions

J.G.: Conceptualization, formal analysis, resources, supervision, review, and administration; R.A.: original draft preparation, validation, case study, writing—review, and editing; P.G.: data curation, methodology, software, validation, formal analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to express their sincere gratitude to the anonymous reviewers for their insightful comments and constructive suggestions, which have greatly contributed to improving the quality and clarity of this manuscript.

Conflicts of Interest

The authors declare no competing interests. The author Radhika Agarwal is employed by the company ALPSTouchStone, Inc. There is no conflict of interest between any of the authors and the company.

References

Businesswire. Global Chatbot Market Value to Increase by $1.11 Billion during 2020–2024|Business Continuity Plan and Forecast for the New Normal|Technavio. Available online: https://www.businesswire.com/news/home/20201207005691/en/Global-Chatbot-Market-Value-to-Increase-by-1.11-Billion-during-2020-2024-Business-Continuity-Plan-and-Forecast-for-the-New-Normal-Technavio (accessed on 15 December 2024).
Ni, J.; Young, T.; Pandelea, V.; Xue, F.; Adiga, V.; Cambria, E. Recent Advances in Deep Learning-based Dialogue Systems. arXiv 2021, arXiv:2105.04387. [Google Scholar]
Tao, C.; Gao, J.; Wang, T. Testing and Quality Validation for AI Software—Perspectives, Issues, and Practices. IEEE Access 2019, 7, 120164–120175. [Google Scholar] [CrossRef]
Gao, J.; Garsole, P.; Agarwal, R.; Liu, S. AI Test Modeling and Analysis for Intelligent Chatbot Mobile App— A Case Study on Wysa. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), Shanghai, China, 15–18 July 2024; pp. 132–141. [Google Scholar] [CrossRef]
Vasconcelos, M.; Candello, H.; Pinhanez, C.; Santos, T.D. Bottester: Testing Conversational Systems with Simulated Users. In Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems, Joinville, Brazil, 23–27 October 2017; pp. 1–4. [Google Scholar] [CrossRef]
Xing, Y.; Fernández, R. Automatic Evaluation of Neural Personality-based Chatbots. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg, The Netherlands, 5–8 November 2018; pp. 189–194. [Google Scholar] [CrossRef]
Bozic, J.; Wotawa, F. Testing Chatbots Using Metamorphic Relations. In Testing Software and Systems; ICTSS, 2019; Lecture Notes in Computer, Science; Gaston, C., Kosmatov, N., Le Gall, P., Eds.; Springer: Cham, Switzerland, 2019; Volume 11812. [Google Scholar] [CrossRef]
Bravo-Santos, S.; Guerra, E.; de Lara, J.; Shepperd, T.C.W.C.I.M.; Abreu, F.B.E.; da Silva, A.R.; Pérez-Castillo, R. (Eds.) Quality of Information and Communications Technology; QUATIC 2020; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2020; Volume 1266. [Google Scholar] [CrossRef]
Mei, Q.; Xie, Y.; Yuan, W.; Jackson, M.O. A Turing test of whether AI chatbots are behaviorally similar to humans. Econ. Sci. 2024, 121, e2313925121. [Google Scholar] [CrossRef] [PubMed]
Bozic, J.; Tazl, O.A.; Wotawa, F. Chatbot Testing Using AI Planning. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 37–44. [Google Scholar] [CrossRef]
Ruane, E.; Faure, T.; Smith, R.; Bean, D.; Carson-Berndsen, J.; Ventresque, A. BoTest: A Framework to Test the Quality of Conversational Agents Using Divergent Input Examples. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, Tokyo, Japan, 7–11 March 2018; pp. 1–2. [Google Scholar] [CrossRef]
Guichard, J.; Ruane, E.; Smith, R.; Bean, D.; Ventresque, A. Assessing the Robustness of Conversational Agents using Paraphrases. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 55–62. [Google Scholar] [CrossRef]
Kaleem, M.; Alobadi, O.; O’Shea, J.; Crockett, K. Framework for the formulation of metrics for conversational agent evaluation. In Proceedings of the RE-WOCHAT: Workshop on Collecting and Generating Resources for Chatbots and Conversational Agents-Development and Evaluation Workshop Programme, Portorož, Slovenia, 28 May 2016; Volume 20, pp. 20–23. [Google Scholar]
Nick, M.; Tautz, C. Practical evaluation of an organizational memory using the goal-question-metric technique. In Proceedings of the Biannual German Conference on Knowledge-Based Systems, Würzburg, Germany, 3–5 March 1999; pp. 138–147. [Google Scholar]
Padmanabhan, M. Test Path Identification for Virtual Assistants Based on a Chatbot Flow Specifications. In Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing; Das, K.N., Bansal, J.C., Deep, K., Nagar, A.K., Pathipooranam, P., Naidu, R.C., Eds.; Springer: Singapore, 2020; Volume 1057. [Google Scholar] [CrossRef]
Uc-Cetina, V.; Navarro-Guerrero, N.; Martin-Gonzalez, A. Survey on reinforcement learning for language processing. Artif. Intell. Rev. 2023, 56, 1543–1575. [Google Scholar] [CrossRef]
Liu, Y.; Wang, H.; Zhou, H.; Li, M.; Hou, Y.; Zhou, S.; Wang, F.; Hoetzlein, R.; Zhang, R. A review of reinforcement learning for natural language processing and applications in healthcare. J. Am. Med. Inform. Assoc. 2024, 31, 2379–2393. [Google Scholar] [CrossRef] [PubMed]
Aslam, F. The impact of artificial intelligence on chatbot technology: A study on the current advancements and leading innovations. Eur. J. Technol. 2023, 7, 62–72. [Google Scholar] [CrossRef]
Ayanouz, S.; Abdelhakim, B.A.; Benhmed, M. A smart chatbot architecture based NLP and machine learning for health care assistance. In Proceedings of the 3rd International Conference on Networking, Information Systems, & Security, Marrakech, Morocco, 31 March–2 April 2020; pp. 1–6. [Google Scholar]
Bialkova, S. Chatbot Efficiency—-Model Testing. In The Rise of AI User Applications; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Bilquise, G.; Ibrahim, S.; Shaalan, K. Emotionally intelligent chatbots: A systematic literature review. Hum. Behav. Emerg. Technol. 2022, 2022, 9601630. [Google Scholar] [CrossRef]
Caldarini, G.; Jaf, S.; McGarry, K. A Literature Survey of Recent Advances in Chatbots. Information 2022, 13, 41. [Google Scholar] [CrossRef]
Gao, J.; Tao, C.; Jie, D.; Lu, S. Invited Paper: What is AI Software Testing? and Why. In Proceedings of the IEEE International Conference on Service-Oriented System Engineering (SOSE), San Francisco, CA, USA, 4–9 April 2019. [Google Scholar] [CrossRef]
Gao, J.; Patil, P.H.; Lu, S.; Cao, D.; Tao, C. Model-Based Test Modeling and Automation Tool for Intelligent Mobile Apps. In Proceedings of the IEEE International Conference on Service-Oriented System Engineering (SOSE), Oxford, UK, 23–26 August 2021; pp. 1–10. [Google Scholar] [CrossRef]
Gao, J.; Li, S.; Tao, C.; He, Y.; Anumalasetty, A.P.; Joseph, E.W.; Nayani, H. An approach to GUI test scenario generation using machine learning. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 15–18 August 2022; pp. 79–86. [Google Scholar]
Kurniawan, M.H.; Handiyani, H.; Nuraini, T.; Hariyati, R.T.S.; Sutrisno, S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann. Med. 2024, 56, 2302980. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Tao, C.; Gao, J.; Guo, H. A Review of Quality Assurance Research of Dialogue Systems. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 5–18 August 2022; pp. 87–94. [Google Scholar] [CrossRef]
Lin, C.-C.; Huang, A.Y.Q.; Yang, S.J.H. A Review of AI-Driven Conversational Chatbots Implementation Methodologies and Challenges (1999–2022). Sustainability 2023, 15, 4012. [Google Scholar] [CrossRef]
Ngai, E.W.; Lee, M.C.; Luo, M.; Chan, P.S.; Liang, T. An intelligent knowledge-based chatbot for customer service. Electron. Commer. Res. Appl. 2021, 50, 101098. [Google Scholar] [CrossRef]
Park, A.; Lee, S.B.; Song, J. Application of AI based Chatbot Technology in the Industry. J. Korea Soc. Comput. Inf. 2020, 25, 17–25. [Google Scholar]
Park, D.M.; Jeong, S.S.; Seo, Y.S. Systematic review on chatbot techniques and applications. J. Inf. Process. Syst. 2022, 18, 26–47. [Google Scholar]
Tran, C.T.P.; Valmiki, M.G.; Xu, G.; Gao, J.Z. An intelligent mobile application testing experience report. J. Phys. Conf. Ser. 2021, 1828, 012080. [Google Scholar] [CrossRef]
Xu, L.; Sanders, L.; Li, K.; Chow, J.C. Chatbot for health care and oncology applications using artificial intelligence and machine learning: Systematic review. J. Med. Internet Res. (JMIR) Cancer 2021, 7, e27850. [Google Scholar] [CrossRef] [PubMed]
Wiech, B.A.; Kourouklis, A.; Johnston, J. Understanding the components of profitability and productivity change at the micro level. Int. J. Product. Perform. Manag. 2020, 69, 1061–1079. [Google Scholar] [CrossRef]
Chung, I.S.; Malcolm, M.; Lee, W.K.; Kwon, Y.R. Applying conventional testing techniques for class testing. In Proceedings of the 20th International Computer Software and Applications Conference: COMPSAC’96, Seoul, Republic of Korea, 19–23 August 1996; pp. 447–454. [Google Scholar]
Pacheco, C.; Lahiri, S.K.; Ernst, M.D.; Ball, T. Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07), Minneapolis, MN, USA, 20–26 May 2007; pp. 75–84. [Google Scholar] [CrossRef]
Schieferdecker, I. Model-Based Testing. IEEE Softw. 2012, 29, 14–18. [Google Scholar] [CrossRef]
Freitas, T.C.; Neto, A.C.; Pereira, M.J.V.; Henriques, P.R. NLP/AI Based Techniques for Programming Exercises Generation. In Proceedings of the 4th International Computer Programming Education Conference (ICPEC 2023), Vila do Conde, Portugal, 26–28 June 2023. [Google Scholar] [CrossRef]

Figure 2. Intelligent chatbot system evaluation metrics.

Figure 3. The 3D AI function test model.

Figure 4. The 3D classification tree model. (a) A single-feature; (b) for selected intelligent chat function features.

Figure 5. A sample of a classification tree for an AI-powered function in the mobile app, Wysa (version 890 v3.7.8). (a) An input classification tree. (b) A context classification tree. (c) An output classification tree.

Figure 6. Chat sample for test modeling. (a) A knowledge and Q&A-based test model. (b) A memory- and subject-oriented test model.

Figure 7. Knowledge and subject-based test modeling. (a) A general domain knowledge model. (b) Knowledge-based model for American history. (c) Subject-based modeling for travel.

Figure 8. Memory-oriented test modeling.

Figure 9. Q&A-based test modeling.

Figure 10. Test model-based test generation.

Figure 11. Similarity evaluation. (a) Language-based similarity evaluation. (b) Integrated language-based similarity evaluation. (c) Keyword-based weighted text similarity evaluation.

Figure 12. A 3D diagram for the general responses.

Figure 13. General response-based testing and analysis. (a) General conversation-based testing. (b) Data visualization of general conversation.

Figure 14. A 3D diagram for the general responses.

Figure 15. Memory-based testing and analysis. (a) Memory-based testing. (b) Data visualization of memory testing.

Figure 16. Emotive reflex-based testing and analysis. (a) Emotive response-based testing. (b) Data visualization of emotion-based conversation.

Figure 17. Sample of a bug report for emotive reflexes.

Figure 18. Q&A-based testing and analysis. (a) Q&A-based testing. (b) Data visualization of Q&A-based conversation.

Table 1. Literature survey based on AI test models.

Ref	Objective	Automated Test Validation	Test Modeling	Test Generation	Augmentation
[5]	Testing conversational systems with simulated users (chatbot specialized in financial advice)	No	Simulated user testing	Scenario-based and behavior-driven	Automated testing augmentation
[6]	Sequence-to-sequence models with long short-term memory (LSTM) for open-domain dialogue response generation	No	Dialogue generation model and personality model using OCEAN personality traits	Context–response pairs from two TV series (Friends and The Big Bang Theory)	Automatically introduces variations in the input
[7]	A metamorphic testing approach for chatbots and obtain sequences of interactions with a chatbot	No	Metamorphic testing	Metamorphic relations to guide the generation of test cases and an initial set of inputs	Constructing grammars in Backus–Naur form (BNF), mutations, and functional correctness
[8]	Testing chatbots with Charm	No	Used Botium to test coherence, sturdiness, and precision testing	Conversation generation, utterance mutation	Fuzzing/mutation, iteration, and improvement
[9]	Turing test applied to AI chatbots to examine how chatbots behave in a suite of classic behavioral games using ChatGPT-4	No	Behavioral and personality testing	Classic behavioral games	Personality assessments using the Big-5 personality survey
Our purpose	Intelligent AI 3D test modeling, test generation, data augmentation, and test result validation for chat systems in Wysa, a mental coach chatbot	Yes (Used AI-based test validation)	3-dimensional AI test modeling	AI-based model, and NLP/ML model	AI model-based positive and negative augmentation

Table 2. A comparison between AI testing, AI-based software testing, and the conventional software testing methods.

Items	AI Testing	AI-Based Software Testing	Conventional Software Testing
Objectives	Validate and assure the quality of AI software and system by focusing on system AI functions and features	Leverage AI techniques and solutions to optimize a software testing process and its quality	Assure the system function quality for conventional software and its features
Primary AI Testing Focuses	AI feature quality factors: accuracy, consistency, completeness, and performance	Optimize a test process in product quality increase, testing efficiency, and cost reduction	Automate test operations for a conventional software process
System Function Testing	AI system function testing: Object detection and classification, recommendation and prediction, language translation	System functions, behavior, user interfaces	System functions, behavior, user interfaces
Test Selection	AI test model-based test selection, classification and recommendation	Test selection, classification, and recommendation using AI techniques	Rule-based and/or experience-based test selection
Test Data Generation	AI test model-based data discovery, collection, generation, and validation	AI-based data collection, classification, and generation	Model-based and/or pattern-based test generation
Bug Detection and Analysis	AI model-based bug detection, analysis, and report	Data-driven analysis for bug classification, detection, and prediction	Digital and systematic bug/problem management.

Table 3. Wysa AI feature scope.

Concept	Description
Mood Tracking and Analysis	Tracking and analyzing an individual’s mood or emotional state over time. It involves recording and assessing mood patterns, fluctuations, and trends to gain insights into emotional well-being.
Self-care Analysis	Analyzing and evaluating an individual’s self-care practices and habits. It involves assessing activities related to physical, mental, and emotional well-being, such as exercise, sleep, relaxation techniques, and mindfulness practices.
Conversational Support Analysis	Analyzing the effectiveness and impact of conversational support provided by chatbots or AI systems. It involves evaluating the quality, empathy, and appropriateness of responses to users’ emotional or support-related queries or needs.
Goal Setting	Setting specific, measurable, attainable, relevant, and time-bound (SMART) goals to promote personal growth and well-being. It involves identifying areas of improvement, defining objectives, and establishing action plans to achieve desired outcomes.
Sentiment Analysis	Analyzing and determining the sentiment or emotional tone expressed in text or conversations. It involves classifying text as positive, negative, or neutral to understand the overall sentiment or attitude conveyed.
Well-being Resources and Personalised Intervention	Providing personalized resources, recommendations, or interventions to support individual well-being. It involves offering tailored suggestions, activities, or resources based on an individual’s needs, preferences, or identified areas of improvement.

Table 4. Negative data augmentation.

Text Input	Augmentation	Text Output
I am Sad	random char insert	I am aSad
I am Sad	random char swap	I am Sda
I am Sad	random char delete	I am ad
I am Sad	random word swap	Sad am I
I am Sad	random word delete	am Sad
I am Sad	ocr augmentation	1 am Sad

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, J.; Agarwal, R.; Garsole, P. AI Testing for Intelligent Chatbots—A Case Study. Software 2025, 4, 12. https://doi.org/10.3390/software4020012

AMA Style

Gao J, Agarwal R, Garsole P. AI Testing for Intelligent Chatbots—A Case Study. Software. 2025; 4(2):12. https://doi.org/10.3390/software4020012

Chicago/Turabian Style

Gao, Jerry, Radhika Agarwal, and Prerna Garsole. 2025. "AI Testing for Intelligent Chatbots—A Case Study" Software 4, no. 2: 12. https://doi.org/10.3390/software4020012

APA Style

Gao, J., Agarwal, R., & Garsole, P. (2025). AI Testing for Intelligent Chatbots—A Case Study. Software, 4(2), 12. https://doi.org/10.3390/software4020012

Article Menu

AI Testing for Intelligent Chatbots—A Case Study

Abstract

1. Introduction

2. Literature Review

3. Understanding Testing Intelligent Chatbot Systems

3.1. Testing Approach

3.2. Testing Scope

4. AI Test Modeling for Intelligent Chatbot Systems

4.1. AI Test Modeling and Analysis

4.2. AI-Based Test Automation for Mobile Chatbot—WYSA((version 890 v3.7.8))

4.3. Classified Test Modeling for Intelligent Chat System

5. Test Generation and Data Augmentation for Intelligent Chatbot Systems

5.1. Test Generation

5.2. Data Augmentation

6. AI-Based Test Result Validation and Adequacy Approaches for Smart Chatbot Systems

Model-Based Test Coverage for Smart Chatbot System

7. AI Test Result and Analysis

7.1. Data Collection and Preprocessing

7.2. General Responses

7.3. Memory-Based Responses

7.4. Emotive Reflexes

7.5. Q&A Interactions

8. Conclusions and Future Scope

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI