To ensure the success of an autonomous system, an efficient method of validation is required. A sufficient number of test cases need to be developed in order to show that the system can perform acceptably in the real world environment in which it was designed to operate. There is a significant body of research related to validation and test case generation techniques for artificial intelligence systems and their evaluation. Existing work in four areas (testing artificial intelligence systems using test cases, artificial intelligence-based test case generation, testing as a search problem, and software and artificial intelligence failures) is now reviewed.
2.1. Testing Artificial Intelligence (AI) with Test Cases
One of the most basic forms of testing an autonomous system is with manually generated test cases. This involves a human tester creating scenarios that will be presented to the AI, or under which the performance of the AI will be evaluated. Felgenbaum [
4], for example, reviewed artificial intelligence systems designed for diagnosis based on medical case studies and concluded that the modularity of the “Situation → Action” technique allowed for rules to be changed or added easily as the expert’s knowledge of the domain grew. This allowed more advanced cases to be used for validation.
Chandrasekaran [
5] suggests that the evaluation of an AI must not be based only on the final result. In [
5], an approach to the validation of Artificial Intelligence Medical (AIM) systems for medical decision-making is presented. The paper also examines some of the problems encountered during AIM evaluations. During performance analysis of AI systems, evaluating success or failure based upon the final result may not show the entire picture. Intermediate execution could show acceptable results even though the final result is unsatisfactory. Evaluating important steps in reasoning can help alleviate this issue.
Another example of AI testing with test cases is presented by Cholewinski
et al. [
6] who discuss the Default Reasoning System (DeReS) and its validation through test cases derived from TheoryBase, a benchmarking system “designed to support experimental investigations of nonmonotonic reasoning systems based on the language of default logic or logic programming”. Through the use of TheoryBase-generated default theories, DeReS was shown to be a success. Cholewinski
et al. also proffer that TheoryBase can be used as a standalone system and that any non-monotonic reasoning system can use it as a benchmarking tool.
Brooks [
7] comments on the use of simulation testing. In [
7] the possibility of controlling mobile robots with programs that evolve using artificial life techniques is explored. Brooks has not implemented or tested the ideas presented; however, some intriguing notions regarding simulation and testing of physical robots are discussed. Using simulated robots for testing, before running the programs on physical robots, has generally been avoided for two reasons [
8,
9,
10]: First, for real-world autonomous systems, there is a risk that the time involved in resolving issues identified in a simulated environment will be wasted due to dissimilarities between the types of events occurring in simulation
versus the real operating space. Second, emulating real-world dynamics in a simulated environment is difficult due to differences in real world sensing. This increases the chance of the program behaving differently in the real world. The use of simulated robots for testing may uncover basic issues impairing a control program. However, this approach tends not to uncover some problems encountered when tested in a real-world environment.
The previous studies have related to validation methods and test cases used to assess AI systems designed for practical or complex tasks in everyday life. Billings
et al. [
11], alternately, explores the use of an AI designed for competition rather than the performance of particular jobs. Poki is an AI driven program built as an autonomous substitute for human players in world-class poker tournaments, (specifically, Texas Hold’em tournaments). In Poker, players are constantly adapting across playing many hands. Two methods are discussed to validate the program: self-play and live-play.
Self-play tests are a simple method of validation where an older version of the tested program is pitted against the current version. This allows a great variety of hands to be played in a short amount of time. Live-play tests seek to alleviate this problem and are, thus, considered by Billings et al. as essential for accurate evaluation. Implementing the poker AI as part of an online game is one of the more effective ways to test performance, as thousands of players are able to play at any given time.
In testing Poki, Billings et al. tested each version of the program for 20,000 hands using the average number of small bets won per hand as a performance measurement before translating the results.
The work done on the Poki poker system shows that validating an AI through testing its function against another AI (itself in this case) is a helpful tool for evaluating system performance. Real world application test cases are also shown to be critical in validating the utility of the AI-based validation process.
2.2. AI Test Case Generation
While manual test case generation may be suitable for a system where the scope of performance is limited, systems that have to operate in a real-world environment must function under a large variety of systems. Given this, a more efficient approach to test case generation is desirable.
Dai, Mausam and Weld [
12] deal with a similar problem, except in the context of evaluating human performance on a large scale. They look at using an AI adaptive workflow, based on their TurKontrol software, to increase the performance of a decision making application. Their workflow controller is trained with real world cases from Amazon’s Mechanical Turk. Mechanical Turk utilizes humans to perform repetitive tasks such as image description tagging. For this, a model of performance and iterative assessment is utilized to ensure appropriate quality. Through autonomously determining whether additional human review and revision was required, TurKcontrol was able to increase quality performance by 11%. Dai, Mausam, and Weld note that the cost of this increased performance is not linear and that an additional 28.7% increase in cost would be required to achieve a level of comparable performance.
The work performed by Dai, Mausam, and Weld provides an implementation framework for autonomously revising AI performance, based upon their work in assessing and refining human performance. The approach can be extended to incorporate AI workers and evaluators, for applications where these tasks can be suitably performed autonomously.
Pitchforth and Mengersen [
13] deal with the problem of testing an AI system. Specifically, they look at the process of validating a Bayesian network, which is based on data from a subject matter expert. They note that previous approaches to validation either involved the comparison of the output of the created network to pre-existing data or relied upon an expert to review and provide feedback on the proposed network. Pitchforth and Mengersen proffer, however, that these approaches fail to fully test the networks’ validity.
While Pitchforth and Mengersen do not provide a specific method for the development of use and test cases, their analysis of the validation process required for a Bayesian network informs the process of creating them. It appears that use cases are relevant throughout their validation framework and test cases are specifically relevant to the analysis of concurrent and predictive validity. Moreover, the convergent and divergent analysis processes may inform the types of data that are required and well suited for test case production.
The use of AI in software development and debugging is also considered by Wotawa, Nica, and Nica [
14], who discuss the process of debugging via localizing faults. Their proposed approach, based on model-based diagnosis, is designed to repetitively test a program or area of code to determine whether it functions properly. To this end, they propose an approach that involves creating base test cases and applying a mutation algorithm to adapt them.
While Wotawa, Nica, and Nica’s work is quite limited (as they note) in the context of line-by-line review of program code, the fundamental concept is exceedingly powerful. Input parameters can be mutated extensively without having to create a mechanism to generate an associated success condition.
AdiSrikanth
et al. [
15], on the other hand, deal with a more generalizable approach. They propose a method for test case creation based upon an artificial bee colony algorithm. This algorithm is a swarm intelligence approach where three classes of virtual bees are utilized to find an optimal solution: employed, onlookers, and scouts. Bees seek to identify “food sources” with the maximum amounts of nectar.
In the implementation for optimizing test cases, a piece of code is provided to AdiSrikanth et al.’s tool. This software creates a control flow graph, based on the input. The software then identifies all independent paths and creates test cases, which cause the traversal of these paths. Optimization is achieved via a fitness value metric.
This work demonstrates the utility of swarm intelligence techniques for test case generation and refinement. AdiSrikanth et al., regrettably, fail to consider the time-cost of their proposed approach. While an optimal solution for a small program can be generated fairly quickly, the iterative approach that they utilize may be overly burdensome for a larger codebase.
Similar to the bee colony work performed by AdiSrikanth
et al. is the ant colony optimization work performed by Suri and Singhal [
2]. Suri and Singhal look at using ant colony optimization (ACO) for performing regression analysis. Specifically, they look at how regression tests should be prioritized to maximize the value of regression testing, given a specific amount of time to perform the testing within.
The time requirements for ACO-selection-based execution ranged between 50% and 90% of the time required to run the full test suite. It appears that the average is around the 80% mark.
A more general view is presented by Harman [
16], who reviews how artificial intelligence techniques have been used in software engineering. He proffers that three categories of techniques have received significant use: optimization and search, fuzzy reasoning, and learning. The first, optimization and search, is utilized by the field of “Search Based Software Engineering” which converts software engineering challenges into optimization tasks. Fuzzy reasoning is used by software engineers to consider real-world problems of a probabilistic nature. Finally, with “Search Based Software Engineering” (SBSE) the wealth of solution-search knowledge in the AI optimization domain is brought to bear on software engineering problems. Harman proffers that the continued integration of AI techniques into software engineering is all but inevitable, given the growing complexity modern programs.
2.3. Testing as a Search Problem
Validation can also be conceived of as a search problem. In this case, the search’s ‘solution’ is a problem in the system being tested. Several search approaches relevant to this are now reviewed. The use of these approaches in testing in most cases remains to be tested and future work in this area may include the comparison of these approaches and their evaluation in terms of performance across various testing applications.
Pop
et al. [
17], for example, present an enhancement of the Firefly search algorithm that is designed to elicit optimal, or near-optimal, solutions to a semantic web service composition problem. Their approach combines the signaling mechanism utilized by fireflies in nature with a random modification approach.
Pop et al. compare the firefly solution with a bee-style solution. The bee-style solution took 44% longer to run, processing 33% more prospective solutions during this time. The firefly approach had a higher standard deviation (0.007 versus 0.002). Pop et al. assert that their work has demonstrated the feasibility of this type of approach.
Shah-Hosseini [
18] presents an alternate approach, called the Intelligent Water Drop (IWD) approach, to problem solving that utilizes an artificial water drop with properties mirroring water drops in nature. Two properties of water drops are important. The first important aspect is its soil carrying capability. The water drops, collectively, pick up soil from fast-moving parts of the river and deposit it in the slower parts. Second, the water drops choose the most efficient (easiest) path from their origin to their destination. The IWD method can be utilized to find the best (or near-best) path from source to destination. It can also be utilized to find an optimal solution (destination) to a problem that can be assessed by a single metric. Duan, Liu, and Wu [
19] demonstrate the IWD’s real-world application in the application of route generation and smoothing for an unmanned combat aerial vehicle (UCAV).
Yet another search technique is presented by Gendreau, Hertz, and Laporte [
20] who discuss an application of a metaheuristic improvement method entitled the Tabu Search, which was developed by Glover [
21,
22]. This approach takes its name from the use of a ‘Tabu List’, which prevents redundant visits to recently visited nodes via placing them on a list of nodes to avoid. The approach is open-ended and allows exploration of less-optimal-than-current solutions to allow the search to leave local minimums in search of the global minimum.
Fundamentally, as an improvement method, the Tabu Search visits adjacent solutions to the current solution and selects the best one to be the new current solution. Because of this, it can be initialized with any prospective solution (even an infeasible one). Gendreau, Hertz, and Laporte evaluate this search in the context of TABUROUTE, a solution to the vehicle routing problem. They conclude that the Tabu Search outperformed the best existing heuristic-based searches and that it frequently arrives at optimal or best known solutions.
Finally, Yang and Deb [
23] propose a search based upon the egg-laying patterns of the cuckoo. This bird lays eggs in the nests of other birds of different species. The bird that built the nest that the cuckoo lays its egg in may, if it detects that the egg is not its own, destroy it or decide to abandon the nest. The Cuckoo Search parallels this. Each automated cuckoo creates an egg, which is a prospective problem solution, which is placed into a nest at random. Some nests that have the generation’s best solutions will persist into the next generation; a set fraction of those containing the worst performing solutions will be destroyed. There is a defined probability of each nest being destroyed or the egg removed (paralleling the discovery of the egg by the host bird in nature). New nests are created at new locations (reached via Levy flights) to replace the nests destroyed (and maintain the fixed number of nests).
Walton, Hassan, Morgan, and Brown [
24] refine this approach. Their Modified Cuckoo Search incorporates two changes designed to increase the speed of convergence at an optimal solution. First, they change the distance of the Levy flight from a fixed value to a value that declines on a generation-by-generation basis, with each generation having a value that is the initial value divided by the square root of the generation (
). Second, they create a mechanism to seed new eggs that are based upon the best currently known performing eggs. To do this, a collection of top eggs is selected and two of these eggs are selected for combination. Walton, Hassan, Morgan, and Brown assert that the Modified Cuckoo Search outperformed the Cuckoo Search in all test cases presented and that it also performed comparably to or outperformed the Particle Swarm Optimization approach.
Bulatovic, Dordevic, and Dordevic [
25] demonstrate the utility of the Cuckoo Search to real-world problems. They utilize it to optimize 20 design variables as part of solving the six-bar double dwell linkage problem in mechanical engineering. Gandomi, Yang, and Alavi [
26] demonstrate its utility on a second set of real-world problems related to design optimization in structural engineering.