AI Testing for Intelligent Chatbots—A Case Study
Abstract
:1. Introduction
- -
- Development based on NLP and machine learning models based on big data.
- -
- Rich media inputs and outputs, text, audio, and images.
- -
- Text-to-speech and speech-to-text functions.
- -
- Text generation, synthesis, and analysis capabilities.
- -
- Uncertainty and non-deterministic in response generation.
- -
- Understanding selected languages and diverse questions and generating responses with diverse linguistic skills. These special features bring many issues and challenges in quality testing and evaluation of intelligent chat systems (like chatbots).
- Issue 1—Lack of quality validation criteria and quality assurance standards leads to difficulty in establishing well-defined, clear, and measurable quality testing requirements.
- Issue 2—Lack of well-defined systematic quality testing methods and solutions.
- Issue 3—Lack of automatic test tools with well-defined, adequate quality test coverage.
- Issue 4—Lack of automatic adequate quality test coverage analysis techniques and solutions.
- Application of AI-based 3-dimensional test modeling, providing a reference classification test model for chatbot systems.
- Discussion of various test generation approaches for smart chatbot systems.
- AI to support test augmentation (positive and negative) for chatbots.
- Validation and adequacy of the test result using an AI-based approach for smart chatbot systems.
- Results and analysis of the AI test, specifically for the Wysa mobile app.
2. Literature Review
3. Understanding Testing Intelligent Chatbot Systems
- Objective 1—Validating the system chat intelligence and functions. It focuses on checking how well an intelligent chat system can interact with users to accept and correctly process incoming inputs in text/image/audio and generate the appreciated responses/answers. These interactive chats must be validated in a predefined domain-specific knowledge scope, including content subjects, topics, Q&A patterns, and interaction flows based on the predefined chat intelligence and functions at the system level.
- Objective 2—Measuring and assuring system non-function quality by evaluating its QoS parameters from different perspectives: (a) language-oriented perception, diversity, and similarity; (b) system-oriented parameters, such as security, reliability, scalability, availability, and performance, (c) chatting-related parameters, such as user satisfaction, response accuracy/relevance, content correctness, and consistency.
- Objective 3—Evaluating the system to see if it is trustworthy for users and customers based on user-oriented quality validation and evaluation.
3.1. Testing Approach
- Limited data training and validation: Most of our tested mobile apps with AI features are built based on machine learning models and techniques, and trained and validated with limited input datasets under ad hoc contexts.
- Data-driven learning features: Many mobile apps with learning features provide static and/or dynamic learning capabilities that affect the under-test software outcomes, results, and actions.
- Uncertainty in system outputs, responses, or actions: Since existing AI-based models are dependent on statistical algorithms, this causes uncertainty in the outcome of AI software. We experienced many mobile apps with AI functions that generated inconsistent outcomes for the same input test data when context conditions are changed.
- Conventional Testing Methods—Using conventional software testing methods to validate any given smart chatbots and applications online or on mobile devices/emulators. These include scenario-based testing, boundary value testing, decision table testing, category partition testing, and equivalence partition testing.
- Crowd-Sourced Testing—Using crowd-sourced testers (freelanced testers) to perform user-oriented testing for given smart chatbots and systems. They usually use ad hoc approaches to validate the given systems as a user.
- Smart Chat Model-Based Testing—Using model-based approaches to establish smart chat models to enable test case and data generation, and support test automation and coverage analysis.
- AI-Based Testing—Using AI techniques and machine learning models to develop AI-based solutions to optimize the smart chatbot system quality test process and automation.
- NLP/ML Model-Based Testing—Using white-box or gray-box approaches to discover, derive, and generate test cases and data focusing on diverse ML models, related structures, and coverage.
- Rule-based testing—This methodology employs rule-based testing to generate tests and data for handling intelligent chat systems. While effective for traditional rule-based chat systems, it faces numerous challenges in addressing modern intelligent chat systems due to their unique characteristics and the complexities introduced by NLP-based machine learning models and big data-driven training.
- System Testing—In this strategy, quality of service (QoS) parameters at the system level will be chosen, assessed, and tested using clearly defined evaluation metrics. This aims to quantitatively validate both system-level functions and AI-powered features. Common system QoS factors encompass reliability, availability, performance, security, scalability, speed, correctness, efficiency, and user satisfaction. AI-specific QoS parameters typically involve accuracy, consistency, relevance, as well as loss and failure rates.
- Leaning-based testing—In this method, the activities and tests conducted by human testers are monitored, and their test cases, data, and operational patterns are collected, analyzed, and learned from to understand how they design tests and data for a specific smart chat system. Additionally, any bugs they uncover can be learned from and utilized as future test cases.
- Metamorphic Testing—Metamorphic testing (MT) is a property-based software testing technique, which can be an effective approach for addressing the test oracle problem and test case generation problem.
3.2. Testing Scope
- Domain Knowledge—Validate a smart chatbot system to see how well the current chatbot system has been trained for selected domain knowledge and demonstrate its knowledge at different levels during its chats and communication with clients.
- Chatflow Patterns—Validate a smart chatbot system to see how well it is capable of carrying out chats in diverse chatting flow patterns.
- Chat Memory—Validate Chatbot’s memory capability about clients’ profiles, cases, chat history, and so on.
- Q&A—Validate the chatbot’s Q&A intelligence to see how well an under-test smart chat system is capable of understanding the questions and responses from clients and providing the responses like a human agent.
- Chat Subject—Validate the Chatbot’s intelligence in subject chatting to see how well an under-test chatbot system is capable of chatting with clients on selected subjects.
- Language—Validate a smart chatbot’s language skills in communicating with clients in a selected natural language.
- Text-to-speech—Check how well the under-test chatbot system can convert text to speech correctly.
- Speech-to-text—Check how well the under-test chatbot system can convert speech audio to text correctly.
- Accuracy—To make sure that the system can accurately process and understand NLP and rich media inputs; generate accurate chat responses in a variety of subject contents, languages, and linguistics using a range of tests with a variety of chat patterns and flows; and assess system accuracy based on well-defined test models and accuracy metrics.
- Consistency—To ensure that the system can consistently process and understand a wide range of NLP, rich media inputs, and client attention, as well as generate consistent chat responses in a variety of subject contents, languages, and linguistics using a variety of tests with chat patterns and flows, and evaluate system consistency based on well-defined test models and consistency metrics.
- Relevance—Use established testing models and profitability metrics [34] to assess system relevance. This ensures the system can comprehend and process a wide array of NLP and multimedia inputs, identify pertinent subjects and concerns, and produce appropriate chat responses across diverse domains, subjects, languages, and linguistic variations through varied testing methodologies.
- Correctness—Assess the accuracy of a chat system’s processing and comprehension of NLP and/or rich media inputs, its ability to provide accurate replies related to domain subjects, contents, and language linguistics, using well-defined test models and metrics.
- Availability—Ensure system availability based on clearly defined parameters at various levels, such as the underlying cloud infrastructure, the enabling platform environment, the targeted chat application SaaS, and user-oriented chat SaaS.
- Security—Assess the security of the system by utilizing specific security metrics to examine the chat system’s security from various angles. This includes scrutinizing its cloud infrastructure, platform environment, client application SaaS, user authentication methods, and end-to-end chat session security.
- Reliability—Assess the reliability of the system by employing established reliability metrics at various tiers. This encompasses evaluating the reliability of the underlying cloud infrastructure, the deployed and hosted platform environment, and the chat application SaaS.
- Scalability—Evaluate system scalability based on well-defined scalability metrics in different perspectives, including deployed cloud-based infrastructure, hosted platform, intelligent chat application, large-scale chatting data volume, and user-oriented large-scale accesses.
- User satisfactory—Assess user satisfaction with the system by employing well-defined metrics from various angles. This includes analyzing user reviews, rankings, chat interactions, session success rates, and goal completion rates.
- Linguistics diversity—Assess the linguistic diversity of the intelligent chat system in its ability to support and process various linguistic inputs, including diverse vocabularies, idioms, phrases, and different types of client questions and responses. Additionally, evaluate the system’s linguistic diversity in the responses it generates, considering domain content, subject matter, language syntax, semantics, and expression patterns.
- Performance—Assess the performance of the system using clearly defined metrics related to system and user response times, processing times for NLP-based and/or rich media inputs, and the time taken for generating chat responses.
- Perception and understanding—Assess the system’s perception and comprehension of diverse inputs and rich media content from its clients using established test models and perception metrics.
4. AI Test Modeling for Intelligent Chatbot Systems
4.1. AI Test Modeling and Analysis
- The 3D Intelligent Chat Test Modeling Process I—In this process, for the selected intelligent chat function feature, set up a 3D classification tree model in three steps. (1) Define an intelligent chat input context classification tree model to represent diverse context classifications. (2) Define an intelligent chat input classification tree model to represent diverse chat input classifications. (3) Define an intelligent chat output classification tree model to represent diverse chat output classifications. Figure 4a shows the schematic diagram for a single-feature 3D classification tree model.
- The 3D Intelligent Chat Test Modeling Process II—For the selected intelligent chat function features, set up a 3D classification forest tree model in three steps. (1) Define an intelligent chat input context classification tree model to represent diverse context classifications. (2) For each selected intelligent chat AI feature, define one intelligent chat input classification tree model to represent diverse chat input classifications. One input classification decision table is generated based on the defined input tree model. As a result, a chat input forest is created, and a set of input decision tables are generated. (3) For each selected intelligent chat AI feature, define one intelligent chat output classification tree model to represent diverse chat output classifications. One output classification decision table is generated based on the defined output tree model. As a result, a chat output forest is created, and a set of output decision tables is generated. Figure 4b shows the schematic diagram for the 3D classification forest tree model for selected features of intelligent chat functions.
- The 3D Intelligent Chat Test Modeling Process III—For each selected intelligent chat function feature, set up a 3D classification tree model in three steps similar to that of process I. A corresponding 3D decision table will be generated. After a modeling process, a set of 3D tree models will be derived, and related 3D decision tables will be generated.
4.2. AI-Based Test Automation for Mobile Chatbot—WYSA((version 890 v3.7.8))
4.3. Classified Test Modeling for Intelligent Chat System
- Knowledge-based test modeling—This is useful to establish test models focusing on selected domain-specific knowledge of a given intelligent chat system to see how well it understands the received domain-specific questions and provides proper domain-related responses relating to responses. For service-oriented domain intelligent chat systems, one could consider diverse knowledge of products, services, programs, and customers. Figure 7a shows a knowledge-based test model, while Figure 7b shows a specific example where American history is considered the knowledge domain and has topics and subtopics as shown in the figure.
- Subject-oriented test modeling—This is useful to establish test models focusing on a selected conversation subject for a given intelligent chat system to see how well it understands the questions and provides proper responses on diverse selected subjects in intelligent chat systems. Typical conversational subjects include travel, driving directions, sports, and so on. Figure 7c shows an elaborate subject-based test modeling for travel and certain queries that a person generally has about it.
- Memory-oriented test modeling—This is useful to establish test models for validating the memory capability of a given intelligent chat system to see how well it remembers users’ profiles, questions, and related chats, as well as interactions. Figure 8 shows the text-based input tree for short and long-term memory.
- Q&A pattern test modeling—This is useful to establish test models for validating the diverse question and answer capability of a given intelligent chat system to see how well it handles diverse questions from clients and generates different types of responses. Figure 9 shows the text-based input tree for the question and answer pattern.
- Chat pattern test modeling—This is useful to establish test models for validating diverse chat patterns for a given intelligent chat system to see how well it handles diverse chat patterns and interactive flows.
- Linguistics test modeling—This is useful to establish test models to validate language skills and linguistic diversity for a given intelligent chat system. Four aspects of test modeling for linguistics include sentences, diverse lexical items, different types of sentences, semantics, and syntax.
5. Test Generation and Data Augmentation for Intelligent Chatbot Systems
5.1. Test Generation
- Conventional Test Generation—The conventional testing criteria, including branch coverage, have been discussed in detail earlier [35]. It is simple and easy to use conventional software testing methods to generate tests for a selected smart chatbot system, but there are certain limitations to this method, i.e., (1) it is difficult to perform test result validation using a systematic way, (2) it is hard to evaluate adequate test coverage, and (3) costs to generate chat input tests manually are high.
- Random Test Generation—Using a random generator to select random chatbot tests as inputs from a given chatbot test DB to validate the under-test intelligent chatbot system. Random text generation can be used to generate unique and engaging text that can be used in various applications, from creative writing to marketing content. An improvised random test generation has been studied along with a feedback study by executing test inputs [36].
- Test Model-Based Test Generation—It is a model-based approach to enable automatic test generation for each targeted AI-powered intelligent chat function/feature [37]. It supports model-based test coverage analysis and complexity assessment. It is important as AI-based test data augmentation solutions and result validation are needed. This paper uses model-based test generation, which is shown in Figure 10, and explains the semantic working for the same. It shows how a 3D classified tree model is used to obtain a decision table using context, input, and output classifications, which are further segmented to validate the AI-based test results.
- AI-Based Test Generation—Using AI techniques and machine learning models to generate augmented tests and test data for each selected test case. There are two types of data generated: (1) synthetic data, which are generated artificially without using real-world images; and (2) augmented data, which are derived from original images with some sort of minor geometric transformations (such as flipping, translation, rotation, or the addition of noise) to increase the diversity of the training set. This method also comes with its challenges, including (1) the cost of quality assurance of the augmented datasets; (2) research and development to build synthetic data with advanced applications; and (3) the inherent bias of original data persisting in augmented data.
- NLP/ML Model Test Generation—It uses white-box or gray-box approaches to discover, derive, and generate test cases and data based on each underlying NLP/ML model to achieve model-based structures and coverage. NLP/ML models have been used in enhancing the computer programming exercise generation in the automatic generation of text and source code for the benefit of both students and teachers [38].
5.2. Data Augmentation
6. AI-Based Test Result Validation and Adequacy Approaches for Smart Chatbot Systems
- Conventional Testing Oracle—Test Oracle is a mechanism that can be used to test the accuracy of a program’s output for test cases. Conceptually, consider testing a process in which test cases are given for testing and the program under test. The output of the two is then compared to determine whether the program behaves correctly for test cases. Ideally, an automated oracle is required, which always gives the correct answer. However, oracles are often human beings who mostly manually calculate the program’s output. Consequently, when there is a discrepancy between the program and the result, we must verify the result produced by the oracle before declaring that there is a defect in the result.
- Text-Based Result Similarity Checking—For text-based chat outputs, AI-based approaches can be used to perform text similarity analysis to determine if the intelligent chat test results are acceptable. Figure 11a,b show the language-based similarity evaluation. Lexical similarity is a measure of two texts’ similarity based on the intersection of word sets from the same or different languages. Lexical similarity scores range from 0 to 1, indicating no common terms between the two texts, to 1, indicating total overlap between the vocabulary. Figure 11c shows the second approach, keyword-based weighted text similarity evaluation with the similarity formula.
- Image-Based Similarity Checking—For image-based outputs, AI-based approaches can be used to perform image similarity analysis to determine if the intelligent chat test results are acceptable. Diverse machine learning algorithms and deep learning models can be used to compute the similarity at the different levels and perspectives between an expected output image and a real output image from the under-test chatbot system, including objects/types, features, positions, scales, and colors.
- Audio-Based Result Similarity Checking—For audio-based outputs, AI-based approaches can be used to perform radio similarity analysis to determine if the intelligent chat test results are acceptable. Diverse machine learning algorithms and deep learning models can be used to compute the similarity (at different levels and perspectives) between an expected output audio and real output audio from the under-test chatbot system, including audio sources/types, audio features, timing, frequencies, noises, and so on.
Model-Based Test Coverage for Smart Chatbot System
- Three-dimensional AI test classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any 3D element—T (CT-x, IT-y, OT-z) in the 3D AI test classification decision table.
- Context classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include one test case for any rule in a context classification decision table.
- Input classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least one test case for any rule in an input classification decision table.
- Output classification decision table test coverage—To achieve this coverage, the test set (3DT-Set) must include at least a tree case for any rule in an output classification decision table.
7. AI Test Result and Analysis
7.1. Data Collection and Preprocessing
- User interactions from the Wysa chatbot, collected over 1 month.
- Simulated test cases generated based on predefined user intents and real-world chatbot interactions.
- Augmented data using NLP-based text transformations, including synonym replacement, word swapping, and OCR-induced variations to simulate noisy inputs.
- Multimodal inputs (text, voice, and emotive indicators) processed using AI-based test scripts collected via the Wysa mobile app.
- Text Normalization: Removing stopwords, punctuation, and special characters.
- Sentiment Analysis: Labeling chatbot responses based on emotional tone (positive, neutral, negative).
- Context Embedding: Mapping conversation sequences into vectorized representations for contextual accuracy measurement.
7.2. General Responses
7.3. Memory-Based Responses
7.4. Emotive Reflexes
7.5. Q&A Interactions
8. Conclusions and Future Scope
- Automation of 3D AI Testing Pipelines—Developing automated tools that integrate test generation, augmentation, and validation within a continuous testing framework.
- Incorporation of Reinforcement Learning—Enhancing chatbot adaptability by integrating reinforcement learning-based self-improving mechanisms for refining responses.
- Contextual and Long-Term Memory Testing—Improving test models to evaluate how well chatbots retain and apply past interactions over extended periods.
- Multimodal AI Testing—Expanding the 3D model to include voice, gesture, and visual input validation, making it applicable to voice assistants and AR/VR interfaces.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Businesswire. Global Chatbot Market Value to Increase by $1.11 Billion during 2020–2024|Business Continuity Plan and Forecast for the New Normal|Technavio. Available online: https://www.businesswire.com/news/home/20201207005691/en/Global-Chatbot-Market-Value-to-Increase-by-1.11-Billion-during-2020-2024-Business-Continuity-Plan-and-Forecast-for-the-New-Normal-Technavio (accessed on 15 December 2024).
- Ni, J.; Young, T.; Pandelea, V.; Xue, F.; Adiga, V.; Cambria, E. Recent Advances in Deep Learning-based Dialogue Systems. arXiv 2021, arXiv:2105.04387. [Google Scholar]
- Tao, C.; Gao, J.; Wang, T. Testing and Quality Validation for AI Software—Perspectives, Issues, and Practices. IEEE Access 2019, 7, 120164–120175. [Google Scholar] [CrossRef]
- Gao, J.; Garsole, P.; Agarwal, R.; Liu, S. AI Test Modeling and Analysis for Intelligent Chatbot Mobile App— A Case Study on Wysa. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), Shanghai, China, 15–18 July 2024; pp. 132–141. [Google Scholar] [CrossRef]
- Vasconcelos, M.; Candello, H.; Pinhanez, C.; Santos, T.D. Bottester: Testing Conversational Systems with Simulated Users. In Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems, Joinville, Brazil, 23–27 October 2017; pp. 1–4. [Google Scholar] [CrossRef]
- Xing, Y.; Fernández, R. Automatic Evaluation of Neural Personality-based Chatbots. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg, The Netherlands, 5–8 November 2018; pp. 189–194. [Google Scholar] [CrossRef]
- Bozic, J.; Wotawa, F. Testing Chatbots Using Metamorphic Relations. In Testing Software and Systems; ICTSS, 2019; Lecture Notes in Computer, Science; Gaston, C., Kosmatov, N., Le Gall, P., Eds.; Springer: Cham, Switzerland, 2019; Volume 11812. [Google Scholar] [CrossRef]
- Bravo-Santos, S.; Guerra, E.; de Lara, J.; Shepperd, T.C.W.C.I.M.; Abreu, F.B.E.; da Silva, A.R.; Pérez-Castillo, R. (Eds.) Quality of Information and Communications Technology; QUATIC 2020; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2020; Volume 1266. [Google Scholar] [CrossRef]
- Mei, Q.; Xie, Y.; Yuan, W.; Jackson, M.O. A Turing test of whether AI chatbots are behaviorally similar to humans. Econ. Sci. 2024, 121, e2313925121. [Google Scholar] [CrossRef] [PubMed]
- Bozic, J.; Tazl, O.A.; Wotawa, F. Chatbot Testing Using AI Planning. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 37–44. [Google Scholar] [CrossRef]
- Ruane, E.; Faure, T.; Smith, R.; Bean, D.; Carson-Berndsen, J.; Ventresque, A. BoTest: A Framework to Test the Quality of Conversational Agents Using Divergent Input Examples. In Proceedings of the 23rd International Conference on Intelligent User Interfaces Companion, Tokyo, Japan, 7–11 March 2018; pp. 1–2. [Google Scholar] [CrossRef]
- Guichard, J.; Ruane, E.; Smith, R.; Bean, D.; Ventresque, A. Assessing the Robustness of Conversational Agents using Paraphrases. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 4–9 April 2019; pp. 55–62. [Google Scholar] [CrossRef]
- Kaleem, M.; Alobadi, O.; O’Shea, J.; Crockett, K. Framework for the formulation of metrics for conversational agent evaluation. In Proceedings of the RE-WOCHAT: Workshop on Collecting and Generating Resources for Chatbots and Conversational Agents-Development and Evaluation Workshop Programme, Portorož, Slovenia, 28 May 2016; Volume 20, pp. 20–23. [Google Scholar]
- Nick, M.; Tautz, C. Practical evaluation of an organizational memory using the goal-question-metric technique. In Proceedings of the Biannual German Conference on Knowledge-Based Systems, Würzburg, Germany, 3–5 March 1999; pp. 138–147. [Google Scholar]
- Padmanabhan, M. Test Path Identification for Virtual Assistants Based on a Chatbot Flow Specifications. In Soft Computing for Problem Solving. Advances in Intelligent Systems and Computing; Das, K.N., Bansal, J.C., Deep, K., Nagar, A.K., Pathipooranam, P., Naidu, R.C., Eds.; Springer: Singapore, 2020; Volume 1057. [Google Scholar] [CrossRef]
- Uc-Cetina, V.; Navarro-Guerrero, N.; Martin-Gonzalez, A. Survey on reinforcement learning for language processing. Artif. Intell. Rev. 2023, 56, 1543–1575. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, H.; Zhou, H.; Li, M.; Hou, Y.; Zhou, S.; Wang, F.; Hoetzlein, R.; Zhang, R. A review of reinforcement learning for natural language processing and applications in healthcare. J. Am. Med. Inform. Assoc. 2024, 31, 2379–2393. [Google Scholar] [CrossRef] [PubMed]
- Aslam, F. The impact of artificial intelligence on chatbot technology: A study on the current advancements and leading innovations. Eur. J. Technol. 2023, 7, 62–72. [Google Scholar] [CrossRef]
- Ayanouz, S.; Abdelhakim, B.A.; Benhmed, M. A smart chatbot architecture based NLP and machine learning for health care assistance. In Proceedings of the 3rd International Conference on Networking, Information Systems, & Security, Marrakech, Morocco, 31 March–2 April 2020; pp. 1–6. [Google Scholar]
- Bialkova, S. Chatbot Efficiency—-Model Testing. In The Rise of AI User Applications; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
- Bilquise, G.; Ibrahim, S.; Shaalan, K. Emotionally intelligent chatbots: A systematic literature review. Hum. Behav. Emerg. Technol. 2022, 2022, 9601630. [Google Scholar] [CrossRef]
- Caldarini, G.; Jaf, S.; McGarry, K. A Literature Survey of Recent Advances in Chatbots. Information 2022, 13, 41. [Google Scholar] [CrossRef]
- Gao, J.; Tao, C.; Jie, D.; Lu, S. Invited Paper: What is AI Software Testing? and Why. In Proceedings of the IEEE International Conference on Service-Oriented System Engineering (SOSE), San Francisco, CA, USA, 4–9 April 2019. [Google Scholar] [CrossRef]
- Gao, J.; Patil, P.H.; Lu, S.; Cao, D.; Tao, C. Model-Based Test Modeling and Automation Tool for Intelligent Mobile Apps. In Proceedings of the IEEE International Conference on Service-Oriented System Engineering (SOSE), Oxford, UK, 23–26 August 2021; pp. 1–10. [Google Scholar] [CrossRef]
- Gao, J.; Li, S.; Tao, C.; He, Y.; Anumalasetty, A.P.; Joseph, E.W.; Nayani, H. An approach to GUI test scenario generation using machine learning. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 15–18 August 2022; pp. 79–86. [Google Scholar]
- Kurniawan, M.H.; Handiyani, H.; Nuraini, T.; Hariyati, R.T.S.; Sutrisno, S. A systematic review of artificial intelligence-powered (AI-powered) chatbot intervention for managing chronic illness. Ann. Med. 2024, 56, 2302980. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Tao, C.; Gao, J.; Guo, H. A Review of Quality Assurance Research of Dialogue Systems. In Proceedings of the 2022 IEEE International Conference on Artificial Intelligence Testing (AITest), Newark, CA, USA, 5–18 August 2022; pp. 87–94. [Google Scholar] [CrossRef]
- Lin, C.-C.; Huang, A.Y.Q.; Yang, S.J.H. A Review of AI-Driven Conversational Chatbots Implementation Methodologies and Challenges (1999–2022). Sustainability 2023, 15, 4012. [Google Scholar] [CrossRef]
- Ngai, E.W.; Lee, M.C.; Luo, M.; Chan, P.S.; Liang, T. An intelligent knowledge-based chatbot for customer service. Electron. Commer. Res. Appl. 2021, 50, 101098. [Google Scholar] [CrossRef]
- Park, A.; Lee, S.B.; Song, J. Application of AI based Chatbot Technology in the Industry. J. Korea Soc. Comput. Inf. 2020, 25, 17–25. [Google Scholar]
- Park, D.M.; Jeong, S.S.; Seo, Y.S. Systematic review on chatbot techniques and applications. J. Inf. Process. Syst. 2022, 18, 26–47. [Google Scholar]
- Tran, C.T.P.; Valmiki, M.G.; Xu, G.; Gao, J.Z. An intelligent mobile application testing experience report. J. Phys. Conf. Ser. 2021, 1828, 012080. [Google Scholar] [CrossRef]
- Xu, L.; Sanders, L.; Li, K.; Chow, J.C. Chatbot for health care and oncology applications using artificial intelligence and machine learning: Systematic review. J. Med. Internet Res. (JMIR) Cancer 2021, 7, e27850. [Google Scholar] [CrossRef] [PubMed]
- Wiech, B.A.; Kourouklis, A.; Johnston, J. Understanding the components of profitability and productivity change at the micro level. Int. J. Product. Perform. Manag. 2020, 69, 1061–1079. [Google Scholar] [CrossRef]
- Chung, I.S.; Malcolm, M.; Lee, W.K.; Kwon, Y.R. Applying conventional testing techniques for class testing. In Proceedings of the 20th International Computer Software and Applications Conference: COMPSAC’96, Seoul, Republic of Korea, 19–23 August 1996; pp. 447–454. [Google Scholar]
- Pacheco, C.; Lahiri, S.K.; Ernst, M.D.; Ball, T. Feedback-Directed Random Test Generation. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07), Minneapolis, MN, USA, 20–26 May 2007; pp. 75–84. [Google Scholar] [CrossRef]
- Schieferdecker, I. Model-Based Testing. IEEE Softw. 2012, 29, 14–18. [Google Scholar] [CrossRef]
- Freitas, T.C.; Neto, A.C.; Pereira, M.J.V.; Henriques, P.R. NLP/AI Based Techniques for Programming Exercises Generation. In Proceedings of the 4th International Computer Programming Education Conference (ICPEC 2023), Vila do Conde, Portugal, 26–28 June 2023. [Google Scholar] [CrossRef]
Ref | Objective | Automated Test Validation | Test Modeling | Test Generation | Augmentation |
---|---|---|---|---|---|
[5] | Testing conversational systems with simulated users (chatbot specialized in financial advice) | No | Simulated user testing | Scenario-based and behavior-driven | Automated testing augmentation |
[6] | Sequence-to-sequence models with long short-term memory (LSTM) for open-domain dialogue response generation | No | Dialogue generation model and personality model using OCEAN personality traits | Context–response pairs from two TV series (Friends and The Big Bang Theory) | Automatically introduces variations in the input |
[7] | A metamorphic testing approach for chatbots and obtain sequences of interactions with a chatbot | No | Metamorphic testing | Metamorphic relations to guide the generation of test cases and an initial set of inputs | Constructing grammars in Backus–Naur form (BNF), mutations, and functional correctness |
[8] | Testing chatbots with Charm | No | Used Botium to test coherence, sturdiness, and precision testing | Conversation generation, utterance mutation | Fuzzing/mutation, iteration, and improvement |
[9] | Turing test applied to AI chatbots to examine how chatbots behave in a suite of classic behavioral games using ChatGPT-4 | No | Behavioral and personality testing | Classic behavioral games | Personality assessments using the Big-5 personality survey |
Our purpose | Intelligent AI 3D test modeling, test generation, data augmentation, and test result validation for chat systems in Wysa, a mental coach chatbot | Yes (Used AI-based test validation) | 3-dimensional AI test modeling | AI-based model, and NLP/ML model | AI model-based positive and negative augmentation |
Items | AI Testing | AI-Based Software Testing | Conventional Software Testing |
---|---|---|---|
Objectives | Validate and assure the quality of AI software and system by focusing on system AI functions and features | Leverage AI techniques and solutions to optimize a software testing process and its quality | Assure the system function quality for conventional software and its features |
Primary AI Testing Focuses | AI feature quality factors: accuracy, consistency, completeness, and performance | Optimize a test process in product quality increase, testing efficiency, and cost reduction | Automate test operations for a conventional software process |
System Function Testing | AI system function testing: Object detection and classification, recommendation and prediction, language translation | System functions, behavior, user interfaces | System functions, behavior, user interfaces |
Test Selection | AI test model-based test selection, classification and recommendation | Test selection, classification, and recommendation using AI techniques | Rule-based and/or experience-based test selection |
Test Data Generation | AI test model-based data discovery, collection, generation, and validation | AI-based data collection, classification, and generation | Model-based and/or pattern-based test generation |
Bug Detection and Analysis | AI model-based bug detection, analysis, and report | Data-driven analysis for bug classification, detection, and prediction | Digital and systematic bug/problem management. |
Concept | Description |
---|---|
Mood Tracking and Analysis | Tracking and analyzing an individual’s mood or emotional state over time. It involves recording and assessing mood patterns, fluctuations, and trends to gain insights into emotional well-being. |
Self-care Analysis | Analyzing and evaluating an individual’s self-care practices and habits. It involves assessing activities related to physical, mental, and emotional well-being, such as exercise, sleep, relaxation techniques, and mindfulness practices. |
Conversational Support Analysis | Analyzing the effectiveness and impact of conversational support provided by chatbots or AI systems. It involves evaluating the quality, empathy, and appropriateness of responses to users’ emotional or support-related queries or needs. |
Goal Setting | Setting specific, measurable, attainable, relevant, and time-bound (SMART) goals to promote personal growth and well-being. It involves identifying areas of improvement, defining objectives, and establishing action plans to achieve desired outcomes. |
Sentiment Analysis | Analyzing and determining the sentiment or emotional tone expressed in text or conversations. It involves classifying text as positive, negative, or neutral to understand the overall sentiment or attitude conveyed. |
Well-being Resources and Personalised Intervention | Providing personalized resources, recommendations, or interventions to support individual well-being. It involves offering tailored suggestions, activities, or resources based on an individual’s needs, preferences, or identified areas of improvement. |
Text Input | Augmentation | Text Output |
---|---|---|
I am Sad | random char insert | I am aSad |
I am Sad | random char swap | I am Sda |
I am Sad | random char delete | I am ad |
I am Sad | random word swap | Sad am I |
I am Sad | random word delete | am Sad |
I am Sad | ocr augmentation | 1 am Sad |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, J.; Agarwal, R.; Garsole, P. AI Testing for Intelligent Chatbots—A Case Study. Software 2025, 4, 12. https://doi.org/10.3390/software4020012
Gao J, Agarwal R, Garsole P. AI Testing for Intelligent Chatbots—A Case Study. Software. 2025; 4(2):12. https://doi.org/10.3390/software4020012
Chicago/Turabian StyleGao, Jerry, Radhika Agarwal, and Prerna Garsole. 2025. "AI Testing for Intelligent Chatbots—A Case Study" Software 4, no. 2: 12. https://doi.org/10.3390/software4020012
APA StyleGao, J., Agarwal, R., & Garsole, P. (2025). AI Testing for Intelligent Chatbots—A Case Study. Software, 4(2), 12. https://doi.org/10.3390/software4020012