ExtendAIST: Exploring the Space of AI-in-the-Loop System Testing

: The AI-in-the-loop system (AIS) has been widely used in various autonomous decision and control systems, such as computing vision, autonomous vehicle, and collision avoidance systems. AIS generates and updates control strategies through learning algorithms, which make the control behaviors non-deterministic and bring about the test oracle problem in AIS testing procedure. The traditional system mainly concerns about properties of safety, reliability, and real-time, while AIS concerns more about the correctness, robustness, and stiffness of system. To perform an AIS testing with the existing testing techniques according to the testing requirements, this paper presents an extendable framework of AI-in-the-loop system testing by exploring the key steps involved in the testing procedure, named ExtendAIST, which contributes to deﬁne the execution steps of ExtendAIST and design space of testing techniques. Furthermore, the ExtendAIST framework provides three concerns for AIS testing, which include: (a) the extension points; (b) sub-extension points; and (c) existing techniques commonly present in each point. Therefore, testers can obtain the testing strategy using existing techniques directly for corresponding testing requirements or extend more techniques based on these extension points.


Introduction
With the broad applications of artificial intelligence on wearable devices, autonomous cars, and smart city, the traditional embedded system with predefined control strategies, including sensor, controller, and actuator, has evolved into the AI-in-the-loop system (AIS). AI-in-the-loop system is an intelligent system with capabilities of self-learning and autonomous decision because of embedded Artificial Intelligence (AI) components/modules or implementation of AI solutions in an embedded system. For instance, autonomous cars [1] perceive the surrounding driving environment by sensors such as radar and camera. Based on the perception data, autonomous cars perform tasks such as obstacles detection and tracking, traffic signals detection and recognition, and route planning through the AI modules implemented under machine learning algorithm. Autonomous cars then make decisions for these tasks and drive automatically on road via the autonomous interactions with driving environment.
The safety-critical system with AI module embedded is potentially unsafe. For example, the adaptive cruise control system is an intelligent system to make cars drive automatically by interacting with the driving environment and learning the corresponding driving behaviors, such as speed up, slow down, and brake. However, the traffic accident of Tesla Model S, which crashes because of the misclassification between the sky and white truck, indicates that the AI module may be vulnerable to the perturbation and make AI-in-the-loop system unsafe. Once some faults in AI module occur without any safety control behavior from system or person, there will be either error decision or fatal consequence. Therefore, it is crucial to detect the erroneous behaviors in AIS and assure its properties (e.g., correctness and robustness) through certain testing techniques.
Some researchers have reviewed the progress of machine learning testing from different perspectives. For example, Hains et al. [2] investigated the existing verification and simulation techniques for the safety of machine learning. Masuda et al. [3] reviewed the error detection of machine learning and the applications of conventional software testing on machine learning. Braiek et al. [4] studied the datasets, and black-and white-box based testing techniques for machine learning testing. Ma et al. [5] discussed the safety challenge in machine learning system. Huang et al. [6] introduced the testing and verification techniques for the safety of machine learning. Zhang et al. [7] presented a comprehensive survey on machine learning testing.
However, the above-mentioned studies mainly focus on the testing or verification techniques without a systematic testing procedure from test data generation to evaluation except the review in [7], and few studies present the differences between testing conventional system and AI-in-the-loop system. Even though Zhang et al. [7] presented a thorough introduction of machine learning testing, including the techniques of detecting data, learning program, framework, open-source tool, and testing objects, they did not mention how to develop a new technique and the further work researchers can conduct based on each testing step. Inspired by the above enquiries, we follow a similar research method as Astor [8] to propose an Extendable framework of AIS Testing, namely ExtendAIST. The ExtendAIST takes each involved testing step as extension point and implemented approaches as sub-extension points to organize the design space of AIS testing techniques practically. Additionally, this framework provides the existing techniques commonly present in individual sub-extension point. Therefore, debuggers can reuse these techniques, extend new techniques over the extension and sub-extension points, or implement new points in the space of AIS testing.
The primary contributions of our paper are summarized as follows.
• The AI-in-the-loop system is proposed to discuss the differences between testing ordinary system and AI-in-the-loop system in terms of testing coverage, test data, testing property, and testing technique based on the distinct nature of AI modules.

•
The testing workflow of AI-in-the-loop system is presented to show the individual step including dataset generation, training and testing for AI-in-the-loop system. • To the best of our knowledge, this is the first time researchers explore the design space of AI-in-the-loop system testing and present an extendable framework ExtendAIST. This framework provides five extension points, 19 sub-extension points, and existing techniques that researchers can use directly or extend further new techniques for corresponding testing requirements.
The remainder of this paper is organized as follows. Section 2 introduces the motivation of ExtendAIST including the architecture of AI-in-the-loop system, the differences between testing ordinary system and AI-in-the-loop system, and the AIS testing workflow. Section 3 describes the architecture, design, and illustration example of ExtendAIST. Section 4 presents the extension points, sub-extension points, and existing techniques in ExtendAIST. Discussions and conclusions are given in Sections 5 and 6.

AI-in-the-Loop System
The conventional embedded system takes as input the data from sensors and computes the control strategies according to the program with specific task. The controllers output the corresponding decision based on the control strategies and transform this decision into a command to actuators. Then, actuators consequently take some operations to control the entire system. Thus, the conventional embedded system is logic deterministic with the certain program and control behaviors. Different from the conventional embedded system, the control strategies of AI-in-the-loop system are obtained by learning from training data under the learning algorithm, which will lead to less precise and non-deterministic strategies. As shown in Figure 1, the AIS learns the online or offline strategies by online learning or offline learning from the sensor data or offline data under the learning algorithm for specific application. Then, it makes a decision according to the learned strategies and forwards this decision to actuators in the format of command. In this case, the system with AI module embedded can perform the autonomous learning, decision making, and controlling with machine learning program. The testing of conventional system follows a general testing procedure: selecting test coverage, generating test cases, executing test cases, and compare the actual outputs with the expected outputs. Nevertheless, because of the nature of machine learning program, the control strategies vary even for the same training data, which may generate different or uninterpretable control behaviors and weaken the correctness and robustness of AI-in-the-loop system. Therefore, there exists the oracle problem [9] in testing AI-in-the-loop system.

AI-in-the-Loop System Testing
The nature of AI module design and implementation, the complex requirements, large-scale network, non-deterministic and uninterpretable learning results bring great challenges in testing AI-in-the-loop system [10][11][12], which distinguishes from traditional system testing. The traditional system is always a logic deterministic program with the definite output against the system under testing (SUT) and the corresponding test case. The execution behavior can be tracked and interpretable with the absolutely deterministic control flow and data flow. However, the output of AI module is non-deterministic. For example, the machine learning based AI module is a data-driven program [7] which delivers a trained model by learning from the training dataset under the learning algorithm. The predicted label of the trained model may be distinct even for the same test data, because the trained model evolves with different training datasets so that different weights and biases are trained. The testing approaches commonly used in testing ordinary system and non-AI modules are not suitable for testing the AI modules. Therefore, more novel testing coverage metrics and test data are necessary when performing the unit testing of AI-in-the-loop system. More testing properties and testing techniques should be investigated when conducting the system testing of AI-in-the-loop system for the behavior consistency.
For the unit testing of traditional system and AI-in-the-loop system, the test adequacy of traditional system can be measured by control flow coverage (e.g., SC, DC, CC, and MC/DC) and data flow coverage (e.g., p-use and c-use). Except for the coverage metrics used above for non-AI modules, the test adequacy of AI modules should be measured by the structure coverage of neural network because of its non-deterministic logic, such as neuron level coverage [13][14][15], layer level coverage [15], and neuron pair level coverage [16]. Based on the above, all activated neurons leading to major function behaviors and corner case behaviors can be covered. The common dataset designed for different application areas sometimes consists of training and testing data. However, the common dataset is not sufficient for each testing task. To meet the specific testing tasks of AI-in-the-loop system, new dataset including training dataset and testing dataset is generated by some test data generation strategies. The training dataset is used to train the learning program to get a trained model, and the testing dataset is used to evaluate the training effectiveness of trained model.
For the system testing of traditional system and AI-in-the-loop system, the traditional system testing mainly focuses on the following functional and non-functional properties, such as correctness, real-time, reliability and safety. Due to the non-deterministic behavior of AI module, the AI-in-the-loop system suffers from the oracle problem. Therefore, more properties beyond those of traditional systems, such as the robustness [7] and stiffness [17] of the AI-in-the-loop system, should be tested by checking the behavior consistency. The robustness measures the resilience of AI-in-the-loop system towards perturbations on data or learning program. For example, if the trained model misclassifies the test data with imperceptible perturbation, then AI-in-the-loop system is said to be of low robustness. The stiffness is proposed to measure the resistance of AI-in-the-loop system by the effect of a small gradient step on one test data upon the loss on another test data. That is, the predicted label of AI-in-the-loop system is convergent for the gradient alignment between different test data. To test unique properties of the AIS, more testing techniques such as adversarial attack [18] and generative adversarial network [19] are proposed. The adversarial attack is a technique generating adversarial perturbations on the original test data to determine whether the AIS is vulnerable to adversarial perturbations.
Testing AI-in-the-loop system is a process of detecting system erroneous behaviors, which aims to guarantee the properties of system, such as correctness and robustness. The AIS learns from the input training dataset and predicts the behavior with the trained model. Therefore, the correctness of the system can be determined by checking whether the output behaviors are consistent with the requirements. The robustness can be determined by checking whether the output behaviors are affected by the adversarial perturbation. As shown in Figure 2, given an AI-in-the-loop system, the testing activity is conducted as the following steps.
1. Generate the test sample, including training dataset and testing dataset, through some test data generation algorithms. The training dataset can be selected from the common benchmarks or designed for some special application areas to train the learning model. The testing dataset is generated with the requirement of test coverage to test the trained model. 2. When testing the ordinary system, the testing procedure follows the black arrowlines in Figure 2.
The system executes the testing dataset and outputs the deterministic decision according to the predefined rules. 3. When testing the AI-in-the-loop system, the testing procedure follows the red and black arrowlines in Figure 2 for AI module and other function modules, respectively. The trained model is learnt from the training dataset under the training model. Then, the AIS executes the testing dataset and outputs the predicted non-deterministic decision. 4. After test execution procedure, the test report is generated to indicate the test results.
The failed test demonstrates an erroneous behavior of system, and the passed test shows the behavior consistency. 5. If some errors are detected during the testing procedure, then conduct the regression testing after all errors are repaired to make sure that no new error is introduced. 6. If no error is detected, then terminate the testing procedure.

Architecture
ExtendAIST encodes the design space of AIS testing, which involves four key elements: test coverage selection, test data generation, testing and verification approach, and evaluation with common dataset. Figure 3a shows the architecture of the ExtendAIST   Since testing technique reveals the erroneous behaviors in AI-in-the-loop system and formal verification technique ensures the properties of the system, we divide the test and verification approaches into two kinds of elements including testing technique and formal verification technique.
Then, we have five elements, referred to as five extension points, to form the design space of AIS testing. Figure 3b indicates the potential solution of testing AI-in-the-loop system, in which the five nodes of the pentagon are the five extension points above, the edge from each node to the center implies the possible techniques used to meet the requirement of testing property, and all the possible techniques are regarded as the sub-extension points (see Section 4). For example, given a testing property, debuggers will select one or more suitable sub-extension points in each extension point to complete this test task. Therefore, the ExtendAIST is a framework providing extension points, sub-extension points, and existing techniques that allow testers to obtain a testing strategy based on existing techniques or extend novel techniques by choosing one or more extension points.

Design
According to the testing workflow defined in Section 2.2, we present the main steps executed in ExtendAIST, as shown in Algorithm 1, which outputs a recommendation of testing strategy for the input testing requirements of AI-in-the-loop system under test. ExtendAIST first investigates each testing requirement from the input testing requirements (Lines 2-20), such as the testing coverage requirement, testing property requirement, and application area requirement. According to the requirement, it inspects and selects the corresponding extension point available for the testing requirements (Lines 6-18). Then, the algorithm inspects and selects the appropriate sub-extension points for the testing requirements (Lines 11-17). All the existing techniques in the selected sub-extension points can be used (Line 16). It generates the testing strategy for the selected testing requirement (Line 19). Finally, it returns the recommended testing strategy for all of the input testing requirements (Line 21). The time complexity of Algorithm 1 is O(n 2 ).

Algorithm 1 Execution steps of testing AI-in-the-loop system in ExtendAIST.
Require: Testing requirement set R of AI-in-the-loop system under test Ensure: Testing strategy 1: Strategy ← ∅ // Initialize the testing strategy 2: for i = 1 → n do // There are n requirements in the set R

3:
ExtenP ← ∅ // Initialize the set of selected extension points for the ith requirement 4: SubExtenP ← ∅ // Initialize the set of selected sub-extension points for the ith requirement 5: Investigate(R i ) // Investigate the ith testing requirement 6: for j = 1 → 5 do // Five extension points 7: f ep ← Inspect(EP j ) // Inspect the jth extension point 8: if f ep == true then 9: ExtenP ← ExtenP ∪ EP j // The jth extension point is appropriate for the ith testing requirement 10: end if 11: for k = 1 → m j do // m j sub-extension points in the jth extension point 12: f sep ← Inspect(SEP k j ) // Inspect the kth sub-extension point 13: if f sep == true then 14: SubExtenP ← SubExtenP ∪ SEP k j // The kth sub-extension point in the jth extension point is appropriate for the ith testing requirement 15: end if 16: Tech ← Select(Tech k j ) // Select the existing techniques in the kth sub-extension point of the jth extension point to test the AIS 17: end for 18: end for 19: ExtenP, SubExtenP, Tech) // Generate the strategy for the ith testing requirement 20: end for 21: return Strategy // Output the recommendation of testing strategy for the input requirement

Illustration
In this section, we take the security and stability intelligent control unit in power system terminal as an illustration example to show the detailed implementation steps of testing AIS following the design space of ExtendAIST and the integration of other available techniques in ExtendAIST.
The security and stability intelligent control unit in power system terminal is an AI-in-the-loop system which aims at predicting exceptions according to the learnt control strategies and taking relevant actions to avoid the potential failure for power transmission. As shown in Figure 4, the security and stability intelligent control unit in power system terminal obtains the transmission sections between the power supply center and electric load center, including the data of current, voltage, power, and frequency. This information is then transferred to the control strategies table to predict the decisions according to the learned strategies. For example, if the current value s 1 is greater than the threshold P 1 , then decision D 1 is decided according to the learned control strategies and transferred to switch off the actuator. Error decisions lead to illegal actions on actuators which can cause incalculable loss. Thus, it is critical to test the security and stability intelligent control unit in power system terminal to reveal the erroneous behaviors and ensure its security and stability.
According to the implementation steps of ExtendAIST, we test the correctness of security and stability intelligent control unit in power system terminal during the process of power transmission. As shown in Figure 3b, the red thick line is one of the solutions of testing the correctness of security and stability intelligent control unit. Firstly, debuggers select the layer level coverage [15] and neuron pair level coverage [16] to measure the numbers of hyperactive neurons in each layer and the influence between layers and ensure the test adequacy of test data generated from metamorphic testing based strategy. Secondly, they take the experienced decisions when exceptions occur as original test data, because no common dataset exists for the security and stability intelligent control unit. Then, they generate new test data with the metamorphic testing based strategy [20,21] on the source test data. Thirdly, once generating the test data, debuggers can test the control strategies learned in AI module by metamorphic testing and determine test results by checking whether metamorphic relation is satisfied. If there are faults in control strategies, e.g., an illegal action of switching on instead of switching off actuators is decided when overloading, debuggers can repair this system by proper repair techniques, such as changing the parameters of training model. Finally, they evaluate the fixed system with the experienced dataset to show the repair quality.  Since the correctness testing of security and stability intelligent control unit in power system terminal is similar to the traditional system testing, and the metamorphic testing can also be applied in traditional system testing, we further test the robustness of security and stability intelligent control unit to show the differences compared to the traditional system testing. As shown in Figure 3b, the black thick line is one of the solutions of testing the robustness of security and stability intelligent control unit. Firstly, debuggers select the neuron level coverage [13][14][15] to measure the numbers of activated neurons so that the major function behaviors and corner-case function behaviors are covered. The adversarial attack [18] is proven to be a promising technique for testing robustness of machine learning system. Therefore, adversarial attack is chosen to generate adversarial examples and test whether the control strategies in AI module misclassify a small perturbation. For example, the decision of the original current value s 1 is D 1 if s 1 < P 1 . The adversarial example s 1 is greater than s 1 and smaller than P 1 , that is, s 1 < s 1 < P 1 , then the AI module should predict the same decision D 1 . If a different decision is predicted, then the security and stability intelligent control unit is said to be of low robustness. Finally, debuggers repair this system and evaluate the fixed system with the experienced dataset to show the repair quality.

Extension Points and Sub-Extension Points
In this section, we explain the five extension points and 19 sub-extension points provided in the ExtendAIST framework for testers using the existing techniques directly or develop new techniques based on these points for the corresponding testing requirements. Each extension point is a key step involved in testing AIS, which is depicted with existing techniques in this framework. The five extension points, 19 sub-extension points, and existing techniques for individual point are summarized in Figure 3b and Table 1, respectively. As shown in Table 2, the recommendation of testing strategy including extension points and sub-extension points are selected for the corresponding testing requirement. " " implies that the sub-extension point is available for the related testing requirement. Take the "Correctness" row as example, all the coverage metrics in different granularity levels of extension point EP_TCM can be used to measure the coverage of network. All five kinds of evaluation common datasets are available for the correctness testing. In the extension point of EP_TDG, only sub-extension point EP_MT can be used to generate test data for testing correctness of AIS. When selecting the testing technique, sub-extension points EP_MUT, EP_MT, and EP_TP are appropriate to perform AIS correctness testing. Neuron level (EP_NL) The coverage of activated neurons for the major and corner-case behaviors Activated neuron coverage [13,14], Major and corner-case coverage [15] Layer level (EP_LL) The number of the most active k neurons in each layer Top-k neuron coverage [15] Neuron pair level (EP_NPL) The sign (or distance) change of neuron n i l (or layer l) affects the sign (or value) of neuron n j l+1 Sign/Distance-Sign/ Value cover [16] Test Data Generation (EP_TDG)

Adversarial attack (EP_AE) Reveal the defects in DNN by executing adversarial examples
Targeted, non-targeted, L p -norm attack [28,29] Mutation testing (EP_MUT) Evaluate the testing adequacy by mutating training data, training program and trained model MuNN [30], DeepMutation [31] Metamorphic testing (EP_MT) Determine the system correctness by checking whether the metamorphic relation is satisfied DeepTest [14] Test prioritization (EP_TP) Measure the correctness of classification by the purity of test data DeepGini [32] Formal Verification Technique (EP_FV) Satisfiability Solver (EP_SS) Transform the safety verification to satisfiability solver problem Safety verification [33] Non-linear problem (EP_NLP) Transform the safety verification to non-linear problem Piecewise linear network verification [34] Symbolic interval analysis (EP_SIA) Transform the safety verification by analyzing symbolic interval ReluVal [35] Reachability analysis (EP_RA) Transform the safety verification by analyzing the reachability problem DeepGo [36] Abstract interpretation (EP_AI) Transform the safety verification to abstract interpretation AI 2 [37], Symbolic propagation [38] Evaluation Dataset (EP_ED) Image classification (EP_IC) Dataset for image recognition MNIST [39,40], EMNIST [41,42], Fashion-MNIST [43,44], ImageNet [45,46], CIFAR-10/CIFAR-100 [47] Self driving (EP_SD) Dataset for training and testing autonomous car system Udacity Challenge [48], MSCOCO 2015 [49,50]  Test coverage metric selection is the first step of test data generation as test data must satisfy the requirement of coverage criterion. There already exist many test coverage metrics for traditional software testing adequacy: statement coverage, condition coverage, decision coverage, and Modified Condition/Decision Coverage (MC/DC). However, all the metrics above cannot be used directly to cover the control and data flow of AI-in-the-loop system due to the structure of learning program for AI module and the non-deterministic program logic. Researchers take neuron as the basic coverage unit to conduct both major function behavior coverage and corner-case behavior coverage of neural network [15]. We divide the neural network coverage into three categories from the perspective of granularity: neuron level coverage, layer level coverage, and neuron pair level coverage.
Therefore, ExtendAIST takes test coverage metric as an extension point EP_TCM, and neuron level coverage, layer level coverage, and neuron pair level coverage as sub-extension points named EP_NL, EP_LL, and EP_NPL, respectively. Then, testers can implement new coverage metric based on these extension points or reuse the existing techniques.

Neuron Level (EP_NL)
The neuron level coverage depends on the output value of neuron. There currently exist four kinds of neuron level coverage in this sub-extension point EP_NL: activated neuron coverage, k-multisection neuron coverage, neuron boundary coverage, and strong neuron activation coverage.
• Activated neuron coverage. A neuron is considered activated if the neuron output is greater than the neuron activation threshold and makes contribution to neural network's behaviors including major function and corner-case behaviors. As shown in Equation (1), the activated neuron coverage [13,14] Cov ANC is the rate of the number of neurons activated and the number of neurons in the whole DNN. Cov ANC = n activated N (1) • k-multisection neuron coverage. Ma et al. [15] proposed using neuron output value range [low n , high n ] to distinguish the major function region and corner-case region. Then, we can measure the coverage of major function region [low n , high n ] by dividing this region into k equal subsections. As shown in Equation (2), k-multisection neuron coverage for a neuron n, Cov k,n is the rate of the number of subsections covered by T and the total number of subsections, in which x is a test input in dataset T, φ(x, n) is the output of neuron n with test input x, and S n i is the set of values in the ith subsection. The k-multisection neuron coverage for the neural network N, Cov KMN , is based on the k-multisection neuron coverage of all neurons in network, which is defined in Equation (3).
• Neuron boundary coverage. In some cases, neuron output φ(x, n) / ∈ [low n , high n ]. That is, φ(x, n) may locate (−∞, low n ) ∪ (high n , +∞), which is referred to as the corner-case region of neuron n. Therefore, the neuron boundary coverage Cov NBC , the rate of the number of neurons falling in corner-case region and the total number of corner cases as in Equation (4), is used to measure how many corner-case regions are covered by test dataset T. N UPPER = {n ∈ N|∃x ∈ T : φ(x, n) ∈ (high n , +∞)} is the set of neurons located in the upper corner-case region, and N LOWER = {n ∈ N|∃x ∈ T : φ(x, n) ∈ (−∞, low n )} is the set of neurons located in the lower corner-case region. The total number of corner cases of neuron boundary coverage is equal to 2 × |N|, because (−∞, low n ) and (high n , +∞) are mutually exclusive and neuron cannot fall in two regions at the same time.
• Strong neuron activation coverage. Because hyperactive corner-case neurons affect the training of DNN significantly, it is essential to measure the coverage of hyperactive corner-case neurons, namely strong neuron activation coverage. Similar to the neuron boundary coverage, strong neuron activation coverage Cov SN A is the rate of the number of neurons falling in the upper corner-case region and the total number of corner cases as in Equation (5).

Layer Level (EP_LL)
In the layer level sub-extension point EP_LL, the neuron coverage from the perspective of top hyperactive neurons in each layer has been further investigated [15]. An effective input test dataset should cover more and more hyperactive neurons in one layer. In this sub-extension point, the top-k neuron coverage has been implemented to cover the top-k neurons in the same layer. • Top-k neuron coverage. Given test input x and neurons n 1 and n 2 in the same layer, if φ(x, n 1 ) > φ(x, n 2 ), then n 1 is more active than n 2 . Here, the top-k neuron coverage is designed to measure how many neurons are activated as the most k neurons in each layer by the input test dataset T. As shown in Equation (6), top k (x, i) is the set of first k neurons, which are ranked in descending order of their outputs.

Neuron Pair Level (EP_NPL)
The neuron pair level coverage focuses on the propagation of changes between neurons from adjacent layers. Inspired by the MC/DC criterion, Sun et al. [16] proposed the neuron pair level coverage by taking α = (n i l , n j l+1 ) as the neuron pair, in which n i l is a neuron regarded as a condition in the lth layer and n j l+1 is a neuron regarded as a decision in the (l + 1)th layer. Hence, neuron pair level coverage is presented to inspect the influence of neurons in the lth layer on neurons in the (l + 1)th layer. The change of neuron n k l when given two test inputs x 1 and x 2 could be a sign change (denoted as sc(n k l , x 1 , x 2 )), a value change (denoted as vc(g, n k l , x 1 , x 2 ), where g is a value change function), and a distance change (denoted as dc(h, l, x 1 , x 2 ), where h is a distance change function) for neurons in the lth layer. Therefore, given the neuron pair α = (n i l , n j l+1 ) and two test inputs x 1 and x 2 , there exit four implemented techniques in this sub-extension point of neuron pair level coverage.

•
Sign-Sign cover. The sign change of condition neuron n i l and signs of other neurons in the lth layer not changing affect the sign of decision neuron n j l+1 in the next layer. That is, if sc(n i l , x 1 , x 2 ) ∧ ¬sc(n k l , x 1 , x 2 ) ∧ (k = i) ⇒ sc(n j l+1 , x 1 , x 2 ), we say that (n i l , n j l+1 ) is sign-sign covered by x 1 and x 2 which is denoted as cov SS (α, x 1 , x 2 ). • Distance-Sign cover. The small distance change of neurons in the lth layer can cause the sign change of decision neuron n j l+1 in the next layer. Namely, if dc(h, l, x 1 , x 2 ) ⇒ sc(n j l+1 , x 1 , x 2 ), we say that (n i l , n j l+1 ) is distance-sign covered by x 1 and x 2 , denoted as cov h DS (α, x 1 , x 2 ). • Sign-Value cover. Similar to sign-sign cover, the sign change of condition neuron n i l and signs of other neurons in the lth layer not changing affect the value of decision neuron n j l+1 in the next layer. That is, if sc(n i l , x 1 , x 2 ) ∧ ¬sc(n k l , x 1 , x 2 ) ∧ (k = i) ⇒ vc(g, n j l+1 , x 1 , x 2 ) , we say that (n i l , n j l+1 ) is sign-value covered by x 1 and x 2 , denoted as cov g SV (α, x 1 , x 2 ). • Distance-Value cover. Similar to distance-sign cover, the small distance change of neurons in the lth layer leads to the value change of decision neuron n j l+1 in the next layer. Namely, if dc(h, l, x 1 , x 2 ) ⇒ vc(g, n j l+1 , x 1 , x 2 ), then (n i l , n j l+1 ) is distance-value covered by x 1 and x 2 , denoted as cov h,g DV (α, x 1 , x 2 ).

Description
During training procedure, some common datasets are used as the training sets to compare the training performance of different training models. During testing procedure, the testing data included in the common datasets are used to measure the training effectiveness and correctness of AI-in-the-loop system. Thus, both training and testing procedures are data-driven. Since the AIS may output different labels for inputs with high similarity, generating test data that can cover not only major function behaviors but also corner-case behaviors becomes an indispensable activity for AI-in-the-loop system testing.
Thus, test data generation is regarded as an extension point, namely EP_TDG, in ExtendAIST. Since various generation strategies and their variants have been investigated over the past decades, there are mainly four kinds of strategies present to generate test data: adversarial examples, generative adversarial examples, metamorphic testing based strategy, and concolic testing based strategy. These four kinds of common used strategies are also regarded as the sub-extension points, respectively: EP_AE, EP_GAE, EP_MT, and EP_CTS. All of these points allow designing the strategy of test data generation for ExtendAIST producing relevant test samples to train, test, and evaluate the AI-in-the-loop system.

Adversarial Examples (EP_AE)
Adversarial examples are test data generated with small, even imperceptible, perturbations on the original test inputs, which cause the network under test to misclassify them [18,28,[74][75][76]. A neural network is vulnerable to adversarial perturbation, and adversarial examples are regarded as effective means to attack network. Therefore, debuggers can detect erroneous behaviors of AI-in-the-loop system by adversarial examples and enhance the robustness by re-training the AI-in-the-loop system against adversarial examples. The following techniques are implemented based on the sub-extension point EP_AE.
• L-BFGS. The box-constrained L-BFGS [18] is proposed to generate test data to solve the following minimization problem under the condition of f (x ) = t. Equation (7) hopes to minimize the distance between x and x , L 2 = x − x 2 2 , and the loss function loss f ,t (x ) for the generated test data x also labeled as t.
Minimization c· x − x 2 2 +loss f ,t (x ) • FGSM. To generate adversarial examples in a quick way, Goodfellow et al. [23] presented the Fast Gradient Sign Method (FGSM) based on the norm L ∞ . As shown in Equation (8), the adversarial example x depends on the single-step and the gradient sign of loss function loss f ,t (x), which determines the direction that increases the probability of the targeted class t.
x = x − · sign(∇loss f ,t (x)) • BIM. To improve the accuracy of adversarial examples, Kurakin et al. [24] proposed the Basic Iterative Method (BIM) by replacing single-step in FGSM with multiple smaller steps α and minimizing the loss function for directing at the targeted label t. Equation (9) indicates that the adversarial example generated from the ith step depends on that from the last step iteratively.
• ILCM. Kurakin [25] further proposed the Iterative Least-likely Class Method (ILCM) by taking the target label with the least likelihood that is the most difficult to attack rather than the label with the most possibility in BIM. • JSMA. Papernot et al. [26] proposed the Jacobian-based Saliency Map Attack (JSMA) to obtain a saliency map, which indicates the influence of each pixel on target label classification. Therefore, adversarial example can be generated with small perturbation by changing pixels with the greatest influence. • One pixel attack. The one pixel attack [77] is presented extremely to generate adversarial image by changing only one pixel of the seeded image. • C&W attack. To get an adversarial image with less distortion on the seeded image, Carlini and Wagner [78] proposed applying three distance metrics-L 0 , L 2 , and L ∞ norms-to have a target attack on neural network.

•
Universal perturbation. The universal perturbation [79][80][81] is proposed for the misclassification of entire test dataset instead of misclassifying one specific test case. In other words, the adversarial examples are generated from the same perturbation and recognized as the same target label.

Generative Adversarial Examples (EP_GAE)
This sub-extension point EP_GAE is provided to allow researchers to generate test data with the generative adversarial nets, which aims at delivering data that cannot be distinguished from the training data or generator.
• Generative adversarial nets. The Generative Adversarial Nets (GAN) [19], including a generative model G and a discriminative model D, are proposed to generate samples by learning and try to fool the discriminative model. The purpose of generative model is producing samples by learning the training data distribution that causes the discriminative model making a mistake. The discriminative model must determine whether the input sample is from training data or generative model G.

Metamorphic Testing Based Strategy (EP_MT)
Metamorphic testing (MT) [82][83][84] is an approach to alleviate the test oracle problem by checking whether the certain property of program is satisfied instead of comparing the expected output with the actual output since some expected outputs are expensive to compute. Such a property is referred to as the metamorphic relation (MR) of the program function. MT is also used as a strategy of test case generation since it generates follow-up test cases according to source test cases and related MRs. Therefore, metamorphic testing based strategy is regarded as a sub-extension point in ExtendAIST, namely EP_MT, to generate test data for AIS. The existing applications of MT on AIS are as follows.

•
Classifier testing. MT has already been applied to generate follow-up test cases for machine learning classifier testing by employing the MRs of permutation, addition, duplication, and removal of attributes, samples, and labels [20,21].

•
DeepTest. To test DNN-based self-driving system, DeepTest [14] takes the property that the output steering angle should keep unchanged under different real-world driving environments as the metamorphic relation to generate new test cases and determine whether the self-driving system satisfies the MR.
• Effect of noise outside ROI. Zhou et al. [22] investigated the effect of noise outside of the drivable area on the obstacle perception inside of the region of interest (ROI). The MR of obstacle perception system is that the noise in the region outside ROI would not cause the obstacles inside ROI undetectable.

Concolic Testing Based Strategy (EP_CTS)
Concolic testing is an integration of concrete execution and symbolic execution [85]. The program is executed with some concrete input values, and then solves the symbolic constraints collected for each conditional statement to cover other execution paths which are uncovered by concrete inputs. New test case variants from concrete inputs are generated during the constraint solving procedure. Thus, concolic testing based strategy is regarded as one of the sub-extension points, namely EP_CTS, to generate test data by integrating concrete execution and symbolic execution.

•
DeepConcolic. The DeepConcolic [27] leverages concolic testing to generate adversarial examples with high coverage. Given a DNN under test, a set of coverage requirements , an unsatisfied requirement r, and the initial test suite T, in the phase of concrete execution, a test input t ∈ T is identified to satisfy requirement r. Then, in the phase of symbolic execution, a new test input t close to one of the test input t from T, is generated to satisfy requirement r and distance constraint t − t ∞ d. Then t is added to the test suite T. t is repeatedly generated until all requirements are satisfied or no more requirements in can be satisfied.

Description
After generating test data, the ExtendAIST selects an effective testing technique to detect erroneous behaviors in the AI-in-the-loop system. The application of machine learning on embedded systems leads to challenges on testing the safety, correctness, and robustness of AI-in-the-loop systems [2][3][4][5][6][10][11][12]33,34]. Researchers have investigated various testing techniques to test the properties of AI-in-the-loop system, including adversarial attack, mutation testing, metamorphic testing, and test prioritization techniques.
The ExtendAIST provides the testing technique as an extension point EP_TT, and the four techniques mentioned above as sub-extension points named EP_AE, EP_MUT, EP_MT (same as the point EP_MT in Section 4.2.4), and EP_TP, respectively. Researchers consequently can apply the existing techniques to reveal error behaviors in system or extend new testing techniques.

Adversarial Attack (EP_AE)
A neural network is vulnerable to imperceptible perturbations [18]. In other words, the adversarial examples are misclassified with high confidence by a trained neural network. Therefore, adversarial attack can be regarded as a powerful method to detect defects in AI-in-the-loop systems [86]. The existing techniques of adversarial attack are described in Section 4.2.2, thus we introduce the categories of adversarial attack with regard to the output label and distance metric in this section.
• Targeted and Non-Targeted Adversarial Attack. According to the classification results for individual adversarial example, adversarial attack is divided into two categories: targeted and non-targeted adversarial attack [28,29]. Targeted attack indicates that the input adversarial example is misclassified as a specific label, while non-targeted attack indicates that the input adversarial example is assigned as any label not equal to the correct one. • L 0 -norm attack. According to the definition of L p -norm in Equation (10), L 0 (Equation (11)) indicates the number of changed features in the original input. As a result, L 0 -norm attack is used to minimize the number of features perturbed and generate new feature data against the target label. • L 2 -norm attack. L 2 (Equation (12)) equals the Euclidean distance between original data and adversarial example. A lower L 2 indicates a smaller perturbation between the individual feature data and a higher similarity between original data and adversarial example. Hence, L 2 is also used to solve the problem of overfitting in adversarial example. • L ∞ -norm attack. L ∞ (Equation (13)) represents the maximum difference among all feature data, which is utilized to control the maximum perturbation between original data and adversarial example.

Mutation Testing (EP_MUT)
Mutation testing is a method to measure test adequacy of a test suite by injecting faults into the original program under test, namely mutants [87]. The ratio of the number of mutants detected and the total number of mutants is referred to as the mutation score. The higher the mutation score is, the stronger the fault detectability of the test suite is. The existing techniques based on this sub-extension point EP_MUT are as follows.

•
MuNN. MuNN [30] proposes five kinds of mutation operators according to the structure of neural network, including deleting neurons in the input layer and hidden layers, changing the bias, weights, and activation functions. A mutant neural network is said to be killed once its output is distinct from the output of the original network. More mutant networks being killed indicates a powerful test suite for DNN testing.

•
DeepMutation. Since a trained model is obtained from the training program and training data, Ma et al. [31] proposed a further study on mutation testing of AI module from two perspectives: (1) generating mutant trained models based on the source-level mutation operators on the training data or training program; and (2) generating mutant trained models directly based on the model-level mutation operators on the original trained model, similar to MuNN.

Metamorphic Testing (EP_MT)
Metamorphic testing determines the correctness of software under test by checking whether the related metamorphic relations are satisfied or not. Thus, it is a fundamental activity identifying diverse metamorphic relations to evaluate their capabilities of fault detection and the quality of software. Since Section 4.2.4 has introduced the existing techniques in point EP_MT, we briefly introduce the DeepTest [14] to show the mechanism of metamorphic testing on AI-in-the-loop system.

•
DeepTest. Nine transformation based MRs have been proposed in DeepTest, namely changing brightness, contrast, translation, scaling, horizontal shearing, rotation, blurring, fog, and rain on the images, to test the robustness of autonomous vehicles. Take images from camera as the source images, and create the follow-up images by one or more transformation MRs on source images. The DNN under test takes source and follow-up images as inputs and outputs the source and follow-up steering angles under different real-world weather conditions, namely θ s = {θ 1 s , θ 2 s , · · · , θ n s }, θ f = {θ 1 f , θ 2 f , · · · , θ n f }. Strictly speaking, the steering angles should keep unchanged under these transformations. That is, θ s = θ f . However, a small variation of steering angles,θ = {θ 1 ,θ 2 , · · · ,θ n }, in real-world driving environment would not affect the driving behaviors. Thus, the variations within the error ranges are allowed, as shown in Equation (14), in

Test Prioritization (EP_TP)
The general procedure of testing DNN-based system is executing DNN-based system against a test dataset with manual labels and inspecting whether the learned label and manual label are identical, during which labeling each test case before testing is a labor-intensive activity. The sub-extension point of test prioritization EP_TP converts the problem of misclassification to the problem of impurity of test dataset, which can determine whether the input test case is misclassified by its impurity without the need of labels. • DeepGini [32]. Take binary classification as example; given a test data t and the output feature vector B = {c 1 , c 2 }, execute DNN to compute the probability of each feature for t. If the probability classified as c 1 is P c 1 = 100%, and c 2 is P c 2 = 0, then the feature vector B has the highest purity and t is more likely to be classified correctly. In contrast, if P c 1 = 50%, P c 2 = 50%, then B has the lowest purity, and t is more likely to be misclassified. ξ(t) is defined as the metric of impurity and the likelihood of t being misclassified. As shown in Equation (15), p t,i is the probability of test case t being classified as class i. A lower p 2 t,i indicates a higher impurity and a higher likelihood to misclassify t.

Description
Since formal verification is also an effective way to assure properties of AI-in-the-loop system, the formal verification is regarded as an extension point EP_FV in the ExtendAIST. There already exist various verification techniques for testing the robustness, testing adequacy, and decreasing test cost, in which the formal verification is always transformed into the following problems: satisfiability solver [33,[88][89][90][91], non-linear problem [34,[92][93][94], symbolic interval analysis [35], reachability analysis [36], and abstract interpretation [37,95]. Therefore, the ExtendAIST provides these five techniques as sub-extension points named EP_SS, EP_NLP, EP_SIA, EP_RA, and EP_AI, respectively. Based on the above, researchers can develop or extend new formal verification techniques and apply the existing techniques to assure the robustness and safety of AI-in-the-loop system.

Satisfiability Solver (EP_SS)
The safety verification of an image classifier can be reduced to the correct behavior satisfiability which can search for adversarial examples if misclassifications exist [33].

•
Safety verification. Given a neural network N, an input image x, a region η around x with the same class, N is said to be safe for input x and η if the classification of images in η is invariant to x. That is, N, η x. In more depth, modifying the input image x with a family of manipulations ∆, N is said to be safe for input x, η, and manipulations ∆ if the classification of region η keeps invariant to x under manipulations ∆, namely N, η, ∆ x. If N, η, ∆ x, then the image classifier is vulnerable to these manipulations or adversarial perturbations.

Non-linear Problem (EP_NLP)
The formal verification of safety is also specified to prove the counterexample not exist for the set of variable constraints to make the property always true [34].

•
Piecewise linear network verification. Since some activation functions are non-linear, the formal verification is transformed into a Mixed Integer Program (MIP) with the value of binary variables δ a , where δ a = {0, 1}. The binary variable δ a indicates the phase of activation function ReLU.
If δ a = 0, the activation function is blocked and the output of related neuron will be 0; otherwise, the activation function is passed and the output of related neuron will be equal to its input value.

Symbolic Interval Analysis (EP_SIA)
Symbolic interval analysis [35] utilizes the symbolic interval arithmetic to obtain the accurate range of DNN's output according to the ranges of input variables.

•
ReluVal. Given an input range X, subintervals of X and security property P. DNN is secure if no value in range X and its subintervals violate property P, that is, any value from range X satisfies P; otherwise DNN is insecure if there exists one adversarial example in X violating P, that is, there exists at least one subinterval containing an adversarial example to make property P unsatisfied.

Reachability Analysis (EP_RA)
To eliminate the limitation of network scale for formal verification, safety verification can also be transformed into reachability analysis [36].

•
DeepGo. If all values in the output range, the lower and upper bounds [l, u], correspond to an input in input subspace X ⊆ [0, 1] n , then the network f is reachable and the reachability diameter is D(X ; f ) = u(X ) − l(X ). The network f is said to be safe for the input x ∈ X if all inputs in X have the same label to x, as shown in Equation (16).
∀x ∈ X : arg max

Abstract Interpretation (EP_AI)
Since the scale of input data and neural network are tremendous, it is infeasible to verify whether individual input satisfies the safety properties precisely. To overcome this obstacle, an abstract domain is used to approximate the concrete domain and verify the safety properties against abstract domain directly, referred to as the abstract interpretation theory [96,97]. The following existing techniques in point EP_AI show the application of abstract interpretation on AIS testing.
• AI 2 . AI 2 [37] proposes to employ the zonotope domain to represent the abstract elements in neural network and output the abstract elements in each layer. Finally, the safety of network is determined by verifying whether the label of abstract output in the output layer is consistent with that of the concrete output. • Symbolic propagation. To improve the precision and efficiency of DNN safety verification, Yang et al. [38] proposed a symbolic propagation method based on the abstract interpretation by representing the values of neurons symbolically and propagating them from the input layer to output layer forwardly. A more precise range of output layer is computed by the interval abstract domain based on the symbolic representation of output layer.

Description
Since the machine learning based AI-in-the-loop system is a data-driven system, it is essential to design datasets adaptive for diverse application areas to conduct the training, testing, and evaluation procedures. There mainly exist five kinds of datasets for different purposes: image classification, self driving, natural language processing, speech recognition, and Android malware detection.
Therefore, we take the common dataset as the super extension point EP_CD and the five kinds of datasets as the sub-extension points named EP_IC, EP_SD, EP_NALP, EP_SR, and EP_AMD. Based on the above, researchers can design more datasets for diverse scenarios and different learning, testing, and evaluation tasks. The existing techniques and common datasets in each sub-extension point are displayed as follows.

Image Classification (EP_IC)
The datasets for image classification are designed for object detection and classification in many application areas. We take the dataset of image classification as a sub-extension point, EP_IC, to identify more samples for training and testing models of image classification.

•
MNIST. MNIST [39,40] is a dataset for recognizing handwritten digits (0-9) including 70,000 images originating from the NIST database [98]. The MNIST dataset is composed of a training dataset with 60,000 images and a test dataset with 10,000 images, each of which contains half clear digits written by government staff and half blurred digits written by students.

•
EMNIST. EMNIST [41,42] is an extension dataset of MNIST for identifying handwritten digits (0-9) and letters (a-z and A-Z). Therefore, there are totally 62 classes in EMNIST, including 10 classes of digits and 52 classes of letters. However, some uppercases and lowercases cannot be distinguished easily (e.g., C and c and K and k), then the letters are merged into 37 classes. • Fashion-MNIST. Fashion-MNIST [43,44] is a dataset with the extremely same format and size of MNIST for identifying 10 classes of fashion products, such as T-shirt, trouser, pullover, dress, coat, sandals, shirt, sneaker, bag, and ankle boots. • ImageNet. ImageNet [45,46] is an image dataset describing the synsets in the WordNet hierarchy with on average 1000 images. • CIFAR-10/CIFAR-100. CIFAR-10 and CIFAR-100 [47] are labeled image datasets consisting of 60,000 32 × 32 color images, 50,000 of which are training images and 10,000 are test images. CIFAR-10 is divided into 10 classes, and there are 5000 training images and 1000 test images for each class. CIFAR-100 is divided into 100 fine classes, and each class contains 500 training images and 100 test images.

Self Driving (EP_SD)
Self driving is a typical autonomous system inspecting the performance of AI-in-the-loop system. Hence, the dataset for self driving is also regarded as a sub-extension point, EP_SD, for tasks such as obstacle detection and tracking and traffic signal recognition. Automatic natural language processing with machine learning releases researchers from the labor-intensive work of text processing. It is essential to design datasets for training and testing systems with the ability of natural language processing, which guarantees the precision of these systems. Therefore, we take the dataset for natural language processing as a sub-extension point, EP_NALP, to design more processing method for various processing tasks.

•
Enron. Enron [55] is an email dataset collected from 150 senior managers of Enron, which contains about 500,000 real email messages. All emails from one manager are organized into a folder, which contains the information of message-ID, date, sender, recipient, and messages.  [57] contains petabytes of data including raw web page data, metadata extracts, and text extracts by web crawling. This corpus provides web-scaled data for developing or evaluating natural language processing algorithms empirically. • SNLI. The SNLI corpus [58,59] contains 570,000 handwritten English sentence pairs with the manual labels entailment, contradiction, and neutral. This dataset is used to evaluate models of natural language inference.

Speech Recognition (EP_SR)
This sub-extension point EP_SR allows designing datasets for learning speech recognition, with which the information omitted by human can be detected and represented by machine.

•
Speech Commands. Speech Commands [60,61] is an audio dataset for detecting single keyword from a set of spoken words. This dataset focuses on the audio of on-device trigger phrases and contains 105,829 utterances of 35 words in format of WAVE files. All these utterance were recorded from 2618 speakers, and each utterance is one second or less. • Free Spoken Digit Dataset. FSDD [62] is a dataset of 2000 audio recordings of spoken English digits in format of WAVE files. These speeches were recorded by four speakers with 50 recordings of each digit per speaker. • Million Song Dataset. This dataset [63,64] provides the feature analysis and metadata of one million popular songs, and the audio can be fetched by the code in this dataset.

•
LibriSpeech. LibriSpeech [65,66] is a collection of about 1000 hours of read English speech with the labels of sentence level. Therefore, this dataset is suitable for full speech recognition.

Android Malware Detection (EP_AMD)
The sub-extension point EP_AMD allows researchers to design more malware datasets or enlarge the scales of the following existing datasets, detect the Android malware, and ensure the security of smart phones with the rapid development.

•
Genome. The Genome dataset [70,71] has collected more than 1200 malwares in different aspects, such as the installation methods, activation mechanism, and the nature of carried malicious payloads. • VirusTotal. VirusTotal [72] is a dataset of mobile Apps samples for analyzing suspicious files and URLs to detect types of malware including viruses, worms, and trojans. • Contagio. The Contagio [73] is a platform for collecting the most recent malware, thus it can be used to analyze and help detect unknown malware.

Discussion
Although we have investigated the possible extension, sub-extension points, and existing techniques of AI-in-the-loop system testing, the AI techniques update significantly with the increasing development of application areas and computation ability, which lead to the crucial fact that even the points provided in ExtendAIST are not adequate to develop new techniques. This case will be the main threat to the validity of our proposed ExtendAIST framework.

Conclusions
The AI-in-the-loop system is defined in this paper for discussing the main differences between testing ordinary system and AI-in-the-loop system in terms of testing coverage, testing data, testing property, and testing approach for unit testing and system testing, respectively. In this case, we find that it is essential to design or extend new testing techniques based the existing techniques of AI-in-the-loop system testing. Therefore, we propose an extendable framework ExtendAIST to explore the design space of testing AI-in-the-loop system. The extendable framework ExtendAIST proposed in our paper provides five extension points, 19 sub-extension points and corresponding testing requirements. Each extension point involves one step in AIS testing procedure, and the sub-extension points for individual extension point are the implemented approaches that could be extended further. Additionally, ExtendAIST also provides some existing techniques in each sub-extension point which can be reused directly or extended new techniques. According to the description of each point, we find that there exist some opportunities for AI-in-the-loop system testing in the following extension points.

•
EP_TCM. Training and testing procedure should be enhanced to adapt to different systems with more complex networks. Thus, the extension point EP_TCM provides the first opportunity to increase the testing coverage on different scaled AI-in-the-loop systems with regard to the traditional software testing coverage metrics. • EP_ED. Since enormous test data involve powerful computing ability and expensive computing cost, the extension point EP_ED provides the second opportunity to design a dataset with relatively smaller size for various application areas. • EP_TT and EP_FV. AI-in-the-loop system testing techniques currently focus on assuring the system correctness, safety, and robustness. This is the third opportunity provided by EP_TT and EP_FV to design new testing approach to test more properties for AI-in-the-loop system including efficiency, interpretability, and stiffness.