A Model for Predicting Statement Mutation Scores

: A test suite plays a key role in software testing. Mutation testing is a powerful approach to measure the fault-detection ability of a test suite. The mutation testing process requires a large number of mutants to be generated and executed. Hence, mutation testing is also computationally expensive. To solve this problem, predictive mutation testing builds a classiﬁcation model to predict the test result of each mutant. However, the existing predictive mutation testing methods only can be used to estimate the overall mutation scores of object-oriented programs. To overcome the shortcomings of the existing methods, we propose a new method to directly predict the mutation score for each statement in process-oriented programs. Compared with the existing predictive mutation testing methods, our method uses more dynamic program execution features, which more adequately reﬂect dynamic dependency relationships among the statements and more accurately reﬂects information propagation during the execution of test cases. By comparing the prediction effects of logistic regression, artiﬁcial neural network, random forest, support vector machine, and symbolic regression, we ﬁnally decide to use a single hidden layer feedforward neural network as the predictive model to predict the statement mutation scores. In our two experiments, the mean absolute errors between the statement mutation scores predicted by the neural network and the real statement mutation scores both approximately reach 0.12.


Introduction
When a programmer writes a program, a mistake may occur in the code.For example, a programmer may incorrectly write x=x-1 as x=x+1, x=x*1, x=x%1, etc.This mistake is referred to as a software fault (i.e., a software bug).When this fault is executed, an incorrect execution result may appear on the corresponding statement.This incorrect execution result often is referred to as a software error and cannot be directly observed.When this software error propagates to an observable program output, a software failure occurs.
A strong-power test suite may detect more software faults than a weak-power one, thus measuring the fault detection capability of a test suite is an important question in software testing.Mutation testing is an approach to determine the effectiveness of a test suite [1][2][3].
The programs with software faults are called mutants.In mutation testing, mutants are generated through automatically changing the original program with mutation operators, where each mutation operator is a rule and can be applied to program statements to produce the program version with a software fault.A mutant is said to be identified by a test suite if at least one test case from the test suite has different execution results on the mutant and the original program.Mutation score, which is the ratio of all identified mutants to all mutants, has been widely used to assess the adequacy of a test suite.
Although mutation testing is obviously useful, it is extremely expensive [4,5].For example, using 108 mutation operators, Proteum [6] generates 4937 mutants for tcas, which is the smallest program among the Siemens programs and contains only 137 non-commenting and non-whitespace lines of code.Thus, testing a large number of mutants can be a big burden.
For solving this problem, researchers have proposed some optimization methods to reduce the cost of mutation testing, such as random mutation [7,8], mutant clustering [9] and selective mutation [10,11].For quickly calculating the mutation score of the whole program, these methods attempt to use a mutant sample to represent all mutants.Random mutation randomly chooses some mutants from all mutants to construct mutation samples.A mutant clustering algorithm first classifies all mutants into different clusters so that the mutants in a cluster have similar identification difficulties, and then selects a small number of mutants from each cluster to construct the mutant sample.Selective mutation uses only a subset of mutation operators to generate a mutant sample.
Different from the above mutant reduction methods, the predictive mutation testing methods [12,13] have been proposed in recent years.The predictive mutation testing methods extract some features related to program structures and testing processes and apply machine learning to predict each mutant's test result (i.e., the identification result).Moreover, these predictive methods' execution time is short.However, the existing predictive mutation testing methods are all designed for object-oriented programs.The same as other methods, the existing predictive mutation testing methods are also mainly used for estimating the mutation score of the whole program.The main differences among the above mutant reduction methods can be shown in Table 1.To make up for the shortcomings of existing predictive mutation testing methods, based on the execution impact map [14] Goradia uses, we suggest a new predictive method.This new method is not only suitable for procedure-oriented programs but also can use a single hidden layer feedforward neural network and seven statement features to predict the mutation score of each program statement.
The prediction of the statement mutation scores includes two major phases: extracting the statement features and determining the mathematical form of predictive model.In the feature extraction phase, we obtain the following seven features to express the effect of a statement on the program outputs: number of executions, path impact factor, value impact factor, generalized path impact factor, generalized value impact factor, latent impact factor, and information hidden factor.In fact, among the above seven features, only a number of executions are adopted by existing predictive mutation testing methods.Compared with the existing predictive mutation testing methods, our method more accurately expresses information propagation among the statements.For a statement, except for the number of executions, its six other features are extracted from the following six aspects respectively: When a test case executes on the statement containing a software fault , an error may be generated.This error either propagates along the original execution path or changes the original execution path.
(1) The fault in the statement may change the program output by generating the errors that propagate along the original execution paths.From this aspect, we extract the statement's value impact factor.
(2) The fault in the statement may change the program outputs by generating the errors altering the original execution paths.From this aspect, we extract the statement's path impact factor.However, in a few cases, the change of execution path does not result in a change of program output.Therefore, we need to analyze further the features of the changed program branch in order to more accurately predict how likely the program output will be changed.
(3) The no longer executed branches lose their ability to pass their information along the original execution path to the program outputs.The loss of this capability may cause the program output to be changed.From this aspect, we extract the statement's generalized value impact factor.
(4) The no longer executed program branch is no longer able to influence the selection of subsequent program branches.Loss of this ability may also impact the program output.From this aspect, we extract the statement's generalized path impact factor.(5) The fault in a statement may cause some program branches, which has not been executed, will be executed.Executing these new branches may cause the program output to change.From this aspect, we extract the statement's latent impact factor.
(6) Sometimes, the program under testing has multiple output statements, some of which happen to have the same output values.In this case, even if the software fault changes the execution path of the test case, the program outputs could still be the same.From this aspect, we extract a statement's information hidden factor.
Among these six factors, the first five factors facilitate program output changes, and the last one prevents program output from changing.
In the phase of determining mathematical form of the predictive model, we compared the following five machine learning models based on Brier scores: artificial neural network (ANN), logical regression (LR), random forest (RF), support vector machine (SVM) and symbolic regression (SR).From the experiment results, the artificial neural networks were identified as the most suitable predictive model.
With the methods in this article, we analyzed the two programs.In the two experiments, the mean absolute errors between the real statement mutation scores and predictive statement mutation scores are 0.1205 and 0.1198, respectively.
The remainder of this paper is organized as below: in Section 2, we introduce some basic terms used throughout the entire paper.In Section 3, we define seven statement features.In Section 4, we propose a method for quickly calculating statement features.In Section 5, we compare the prediction accuracy of five machine learning models.In Section 6, we introduce the structure of our automated prediction tool.In Section 7, we describe the work to be performed.

Basic Terms Definition 1. Original program and mutation score.
In this paper, a program without any software fault is also called an original program.For example, Program 1 is an original program.It first outputs the factorial of the absolute value of the difference between m and n, and then classifies the factorial.Based on the relationships among m, n and the factorial, the execution results of the program are divided into three areas, the first and third of which belong to the first class, and the second of which belongs to the second class.
A program with software faults is called a mutant.In mutation testing, mutants are generated through automatically changing the original program with mutation operators.For example, in terms of Program 1, if the statement dist=m-n is changed into dist=m%n, then the mutant m 1 is generated as shown in Program 2. If a test suite (i.e., a collection of test cases ) can identify the mutant m 1 , it must satisfy the following conditions: there must be at least one test case in the test suite to execute the statement dist=m%n in m 1 , the execution result of dist=m%n must be different from that of dist=m-n, and the difference must be propagated to the program output.
Program mutation score is the proportion of identified mutants in a program, which is used to assess how well the program is tested by the test suite.Statement mutation score is the the proportion of identified mutants in a statement, which is used to assess how well the statement is tested by the test suite.

Definition 2. Program statement and branch.
In this article, we predict the ability of a test suite to test each line program code.A statement in the program under testing usually occupies one line.Because a control expression usually occupies a line in the program, in this paper, we also think of a controlling expression as a statement.As shown in Program 1, we denote gth statement as s g .According to C programming language standard-C99 [15], a controlling expression can occur in "if", "switch", "while", "do while" and "for" statements and decides which of the program branches is executed.
In terms of if-else statement, if its controlling expression appears in the rth line, then we denoted its controlling expression as s r , and use B r,t and B r, f to denote the true branch and false branch of s r , respectively.In terms of a loop statement (such as while loop, do-while loop and for loop), we regard it as the combination of the controlling expression and the corresponding program branch.If a loop statement's controlling expression appears in the rth line, then its controlling expression is denoted as s r , and the corresponding loop body is considered as the true branch of s r , so that this loop body can also be denoted as B r,t .According to this representation method, the program branch whose function is to exit the loop is denoted as B r, f .For example, in Program 1, s 9 is the controlling expression, the statement s 10 constitutes its true branch B 9,t , and the statements s 11 , s 12 and s 13 constitute its false branch B 9, f .The statements s 6 and s 7 constitute the loop body of the while loop, and, in this situation, the loop body is also considered as the true branch B 5,t of the controlling expression s 5 .

Definition 3. Statement instance and branch instance.
A statement may be executed multiple times by a test suite, so that multiple execution instances are generated.The statement's each execution instance is called its a statement instance.The hth execution instance of test case t k on statement s g is denoted as s h g,t k .In this paper, the execution instance of a program output statement is called an output statement instance.In addition, the execution instance of a controlling expression is also considered as a special statement instance, and is called a controlling expression instance.
For example, when Program 1 is executed by test case t 1 (m = 4, n = 1), the assignment statement s 4 , controlling expression s 5 , controlling expression s 9 , controlling expression s 11 and output statement s 13 are executed once, three times, once, once and once.This allows them to produce one, three, one, one and one execution instance, respectively, during the execution of the test case t 1 .Among them, the controlling expression instances s 1 5,t 1 , s 2 5,t 1 and s 3 5,t 1 , respectively, represent the first, second and third executions of the test case t 1 on the statement s 5 .
A program branch may also be executed multiple times, so that many execution instances are generated.Each execution instance of the program branch is called a branch instance.Just as a program branch consists of many statements, a branch instance consists of many statement instances.These statement instances are called the statement instances in the branch instance.A bit similar to the symbols of statement instances, we use B l r,z,t k to represent the lth execution instance of the test case t k on the program branch B r,z , where z represents the true or the false branch, and its value is t or f .Whether B r,z is executed depends on the execution result of the controlling expression s r .
For example, the branch instance B 1 9,t,t 1 consists of s 1 10,t 1 , and the branch instance B 1 9, f ,t 1 consists of s 1 11,t 1 , s 1 12,t 1 and s 1 13,t 1 .In terms of the while statement in Program 1, s 5 is a controlling expression and generates three execution instances s 1 5,t 1 , s 2 5,t 1 and s 3 5,t 1 during the execution of the test case t 1 .Because the execution of B 1 5,t,t 1 is the necessary condition for B 2 5,t,t 1 to be executed, B 2 5,t,t 1 is contained in B 1 5,t,t 1 .As shown in Table 2, Figures 1 and 2, the first branch instance B 1 5,t,t 1 of the while loop consists of the statement instances s 1 6,t 1 , s   The execution history H k of the test case t k is formed when the test case t k executes on an original program.The execution history H k is an execution trace, each element of which is a statement instance.These statement instances are ordered by time until the last program output.In this paper, the execution history H k of the test case t k is also called the original execution path of t k .
For example, consider the Program 1, where test case t 1 (m = 4, n = 1), test case t 2 (m = 2, n = 2), and test case t 3 (m = 1, n = 4) constitutes the test suite T. As shown in Table 2, when t 1 is executed, H 1 is generated, and the program outputs fac = 6 and class 1.When t 2 is executed, H 2 is generated, and the program outputs fac = 1 and class 2. When t 3 is executed, H 3 is generated and the program outputs fac = 6 and class 1.

Definition 5. Execution impact graph
An execution impact graph G k is formed when the test case t k executes.The execution impact graph G k consists of multiple impact arcs generally, and each impact arc expresses the information propagation between the statement instances.In terms of an impact arc, the arc tail s j i,t k is called a direct impact predecessor, and the arc head s h g,t k is called a direct impact successor.In the practical application, if a variable is assigned in the statement instance s j i,t k and is directly used at the statement instance s h g,t k , then s j i,t k is a direct impact predecessor of s h g,t k , and s h g,t k is a direct impact successor of s j i,t k .
In the execution impact graph G k , each node is expressed in the form of , where s j i,t k denotes a statement instance and the symbol * indicates that the statement s i is not executed by test case t k .
For example, when program 1 is executed by test cases 1, 2, and 3, the corresponding execution impact graphs are generated respectively, as shown in Figures 1, 2, and 3.In Program 1, the variable dist is defined in the statement s 2 and is directly used in the statements s 5 , s 6 and s 7 .Hence, when the test case t 1 is executed, s 1 2,t 1 becomes the direct impact predecessor of s 1 5,t 1 , s 1 6,t 1 and s 1 7,t 1 , respectively.In this situation, s 1 5,t 1 , s 1 6,t 1 and s 1 7,t 1 become the direct impact successors of s Each direct impact successor of a statement instance may have its own direct impact successor.Thus, the impact successor is transitive.If a statement instance is the impact successor of the statement instance s h g,t k but it is not the direct impact successor of s h g,t k , then this statement instance is called the indirect impact successor of s h g,t k .Thus, the impact successor can be divided into two types: the direct impact successor and the indirect successor.For example, s 1 5,t 1 , s 1 6,t 1 , s 1 7,t 1 , s 2 5,t 1 , s 2 6,t 1 , s 2 7,t 1 , s 3 5,t 1 , s 1 8,t 1 and s 1 11,t 1 are all the impact successors of s 1 2,t 1 .However, s 1 5,t 1 , s 1 6,t 1 and s 1 7,t 1 are the direct impact successors of s 1 2,t 1 , and s 2 5,t 1 , s 2 6,t 1 , s 2 7,t 1 , s 3 5,t 1 , s 1 8,t 1 and s 1 11,t 1 are the indirect impact successors of s 1 2,t 1 .If there is a fault f g in statement s g , and s h g,t k is an execution instance of statement s g , then f g may change the execution result of s h g,t k during the execution of the test case t k .If this change happens, we say that a error e h g,t k is generated from the statement instance s h g,t k .In this paper, an error is different from a fault .Errors are dynamic and are generated in the process of the test case execution.However, faults are static.Whether the program under testing is executed or not, they may exist in the program under testing.

Formal Definitions of Statement Features
In this section, we propose the seven features of a statement.The most of them are related to execution paths of test cases.When the statement containing a software fault is executed by a test case, an error may generate.After this error generates, it either propagates along the original execution path of the test case or changes the original execution path.The value impact factor describes the ability of the fault existing in a statement to affect the program output under the condition that the execution path is unchanged.The path impact factor, the generalized value impact factor, the generalized path impact factor and the latent impact factor describe the abilities of the fault existing in a statement to affect the program output under the condition that the execution path is changed by the generated error.

Value Impact Factor
The value impact factor of a statement expresses its ability to directly impact the program outputs along the execution paths of the test cases.

Value Impact Factor of Statement
The errors generated from the statement instance s h g,t k may propagate along the original execution path H k to some execution instances of the output statements.Each of these output statement instances is called the value impact element of the statement instance s h g,t k .The collection consisting of all value impact elements of s h g,t k is called value impact set of the statement instance s h g,t k , and denoted as V h g,t k .A statement s g has multiple execution instances generally and each execution instance has its own value impact set.The union of these value impact sets is called the value impact set of s g , and is denoted as V g .The element in V g is called the value impact element of s g .The number of value impact elements of s g is called the value impact factor of s g , and is denoted as x vi (s g ).Therefore, the following formula holds: where K is the total number of test cases in the test suite, and H gk is the total number of times the statement s g is executed by the test case t k .

The Value Impact Relationship between Statement Instance and Its Direct Impact Successors
According to the relationship between the impact precursor and the impact successor, we get the following conclusion: If the statement instances s p n ,t k are all direct impact successors of the statement instance s h g,t k , then we get Example 2. From Figure 1, we can know that the direct impact successors of the statement instance s 1 7,t 1 consist of s 2 5,t 1 , s 2 6,t 1 and s 2 7,t 1 .Under the condition that we know V 2 5,t 1 = ∅, V 2 6,t 1 = {s 1 8,t 1 } and V 2 7,t 1 = ∅, we have This formula indicates that the errors generated from the statement instance s 1 7,t 1 can reach up to one output statement instance s 1 8,t 1 when it propagates along the original execution path of the test case t 1 .Using the same method, we also know V 1 6,t 1 = {s 1 8,t 1 }, V 2 6,t 1 = {s 1 8,t 1 }, V 2 5,t 1 = ∅ and V 3 5,t 1 = ∅.

Value Impact Set of Branch Instance
The information expressed by the statement instances in the branch instance B l r,z,t k can propagate along the original execution path H k to some execution instances of the program output statements.These affected output statement instances constitute the value impact set V l r,z,t k of the branch instance B l r,z,t k .We can get the following formula: where k are all the statement instances in the branch instance B l r,z,t k .
Example 3. We can use formula (3) to calculate the value impact set of the branch instance B 1 5,t,t 1 .From Example 2, we know both Because the branch instance B 1 5,t,t 1 consists of the six statement instances s 1 6,t 1 , s 1 7,t 1 , s 2 5,t 1 , s 2 6,t 1 s 2 7,t 1 , and s 3 5,t 1 , we get

Value Impact Set of the Special Statement Instance
If a statement instance is an output statement instance, it usually does not have any impact successors.We set its value impact set to itself because the change of its execution result is precisely the change of program output.If a statement instance is not an output statement instance and does not have any impact successors, then we set its value impact set to an empty set.

Path Impact Factor
The path impact factor of a statement expresses its ability to directly impact the execution paths of the test cases.

Path Impact Factor of Statement
The more controlling expression instances a statement impact, the more easily the fault in the statement changes the execution paths of the test cases.The more likely the execution path is changed, the more likely the program output will be changed.Therefore, we take the number of the control expression instances impacted by a statement during the test suite execution as a feature to describe the effect of this statement on program output.For this purpose, we defined a statement's path impact factor.
The errors generated from the statement instance s h g,t k may propagate along the original execution path H k to some controlling expression instances.The collection of these controlling expression instances is called the path impact set P h g,t k of the statement instance s h g,t k .The element in P h g,t k is called the path impact element of s h g,t k .The path impact set of statement s g is the union of path impact sets of execution instances of s g , and denoted as P g .In other words, K is the total number of test cases in the test suite, and H gk is the total number of times the statement s g is executed by the test case t k .
Example 4. From Table 2, we know that the statement s 6 has four execution instances s 1 6,t 1 , s 2 6,t 1 , s 1 6,t 3 and s 2 6,t 3 .If s 6 includes a fault, then when s 6 is executed by the test suite, each execution instance may generate an error.The errors generated from first two statement instances s 1 6,t 1 and s 2 6,t 1 may propagate along the original execution path H 1 to the controlling expression instance s 1  11,t 1 .Along the original execution path H 3 , the errors generated from the last two statement instances s 1 6,t 3 and s 2 6,t 3 cannot be propagated to any controlling expression instance.Therefore, P 1 6,t 1 = P 2 6,t 1 = {s 1 11,t 1 } and P 1 6,t 3 = P 2 6,t 3 = ∅.Using Formula (4), we get

The Path Impact Relationship of the Statement Instance and Its Direct Impact Successor
According to the relationship between the impact precursor and the impact successor, we get the following conclusion: If the statement instances s p n ,t k are all direct impact successors of the statement instance s h g,t k , then Example 5. From Figure 1, we can know the the direct impact successors of the statement instance s 1 7,t 1 consist of s 2 5,t 1 , s 2 6,t 1 and s 2 7,t 1 .Under the condition that we know P 2 5,t 1 = {s 2 5,t 1 }, P 2 6,t 1 = {s 1 11,t 1 } and P 2 7,t 1 = {s 3 5,t 1 }, according to Formula (5), we have Therefore, the errors generated from the statement instance s 1 7,t 1 can change up to three controlling expression instances s 2 5,t 1 , s 1 11,t 1 and s 3 5,t 1 along the original execution path H 1 .Using the same method, we can also get P 1 6,t 1 = {s 1 11,t 1 }, P 3 5,t 1 = {s 3 5,t 1 }, and so on.

Path Impact Set of Branch Instance
The statement instances in the branch instance B l r,z,t k may propagate their information along the original execution path H k to some of the controlling expression instances outside of B l r,z,t k .These controlling expression instances constitute the path impact set P l r,z,t k of the branch instance B l r,z,t k .The path impact set of B l r,z,t k express the impact of B l r,z,t k on the controlling expression instances outside of B l r,z,t k .If k are all the statement instances in the branch instance B l r,z,t k , then the following mathematical formula holds Example 6.We illustrate the formula above by calculating the path impact set of the branch instance B 1 5,t,t 1 .From Example 5, we know that P 1 6,t 1 = {s 1 11,t 1 }, P 1 7,t 1 = {s 2 5,t 1 , s 3 5,t 1 , s 1 11,t 1 }, P 2 5,t 1 = {s 2 5,t 1 }, P 2 6,t 1 = {s 1 11,t 1 }, P 2 7,t 1 = {s 3 5,t 1 }, and P 3 5,t 1 = {s 3 5,t 1 }.Because the branch instance B 1 5,t,t 1 consists of the six statement instances s 1 6,t 1 , s If a statement instance is a controlling expression instance, it usually does not have any impact successors.We set its path impact set to itself because the change of its execution result is precisely the change of the program execution path.If a statement instance is not a controlling expression instance and does not have any impact successors, then we set its path impact set to an empty set.

Generalized Value Impact Factor
The generalized value impact factor of a statement expresses its ability to indirectly impact the program outputs.

Generalized Value Impact Factor of Statement
The error generated from the statement instance s h g,t k may propagate to some controlling expression instances along the original execution path of test case t k , so that the execution results of these controlling expression instances may be changed.As long as the execution result of the control expression instance s l r,t k is changed, the branch instance B l r,z,t k , which appears in the original execution path H k , will no longer be executed.This makes the statement instances in B l r,z,t k no longer pass their information to some output statement instances.Thus, the execution results of these output statement instances may be changed.Therefore, the errors generated from the statement instance s h g,t k may indirectly affect some output statement instances through the above error propagation process.These output statement instances that may be indirectly influenced by s h g,t k form the generalized value impact set of the statement instance s h g,t k .The generalized value impact set of the statement instance s h g,t k is denoted as V h g,t k .The element in V h g,t k is called the generalized value impact element of s h g,t k .The number of generalized value impact element of s h g,t k is called the generalized value impact factor of s h g,t k , and is denoted as x gvi (s h g,t k ).A statement s g has multiple execution instances generally and each execution instance has its own generalized value impact set.In order to describe this indirect effect of s g on program output, the union of these generalized value impact sets is called the generalized value impact set of s g .The generalized value impact set of s g is denoted as V g , the element in V g is called the generalized value impact element of s g , and the number of the generalized value impact element of s g is called the generalized value impact factor of s g .In summary, where K is the total number of test cases in the test suite, and H gk is the total number of times the statement s g is executed by the test case t k .

Generalized Value Impact Set of the Special Statement Instance
A controlling expression instance s l r,t k usually does not have any impact successors.Corresponding to s l r,t k , there is usually a branch instance B l r,z,t k that appears in the original execution path H k .In this situation, the generalized value impact set of s l r,t k is equal to the value impact set of the branch instance B l r,z,t k .This conclusion can be interpreted as follows: If an error is generated from the controlling expression instance s l r,t k , then the branch instance B l r,z,t k will no longer be executed, so that the statement instances in B l r,z,t k can no longer propagate their information along the original execution path H k to some output statement instances.This error propagation process also exactly reflects the impact of branch instance B l r,z,t k on program output.Hence, the above conclusion is proved.For example, the generalized value impact set of the controlling expression instance s 1 5,t 1 is equal to the value impact set of the branch instance B 1 5,t,t 1 .If a statement instance is not a controlling expression instance and does not have any impact successors, then we set its generalized value impact set to an empty set.

Generalized Path Impact Factor
The generalized path impact factor of a statement expresses its ability to indirectly change the program execution path.

Generalized Path Impact Factor of Statement
The error generated from the statement instance s h g,t k may propagate to some controlling expression instances along the original execution path of test case t k .As long as the execution result of the control expression instance s l r,t k is changed, the branch instance B l r,z,t k that appears in the original execution path H k will no longer be executed.The statement instances in B l r,z,t k will no longer pass their information to the controlling expression instances appearing after B l r,z,t k .In this situation, the execution results of the controlling expression instances appearing after B l r,z,t k may be changed because they are no longer influenced by the statement instances in B l r,z,t k .Therefore, the errors generated from the statement instance s h g,t k may indirectly affect some controlling expression instances appearing after B l r,z,t k through the above error propagation process.These controlling expression instances that may be indirectly affected by s h g,t k through the above error propagation process form the generalized path impact set of the statement instance s h g,t k .This set is denoted as P h g,t k , the element of which is called the generalized path impact element of s h g,t k .The number of generalized path impact elements of s h g,t k is called the generalized path impact factor of s h g,t k , and is denoted as x gpi (s h g,t k ).
A statement s g has one or more execution instances generally.Therefore, the generalized path impact set of s g is defined as the union of generalized path impact sets of the execution instances of s g , and is denoted as P g .In other words, where K is the total number of test cases in the test suite, and H gk is the total number of times the statement s g is executed by the test case t k .The element in P g is called the generalized path impact element of s g .The number of generalized path impact element of s g is called the generalized path impact factor of s g , and denoted as x gpi (s g ).
Example 9. We explain the above definitions by calculating the generalized path impact set of the statement s 7 .
In Program 1, the statement s 7 has four execution instances s If a statement instance s l r,t k is a controlling expression instance, then it usually does not have any impact successors.Corresponding to s l r,t k , there is usually a branch instance B l r,z,t k , which exists in the original execution path H k .In this situation, the generalized path impact set of s l r,t k is precisely the path impact set of B l r,z,t k .This conclusion can be interpreted as follows: Assume there is a software fault in statement s r .If an error is generated from the controlling expression instance s l r,t k , then the branch instance B l r,z,t k will no longer be executed, the information expressed by the statement instances in B l r,z,t k can no longer propagate along the original execution path H k to some controlling expression instances outside of B l r,z,t k .This error propagation process also exactly reflects the impact of branch instance B l r,z,t k on the execution path of the test case t k .Therefore, the generalized path impact set of s l r,t k is equal to the path impact set of B l r,z,t k .For example, the generalized path impact set of the controlling expression instance s 1 5,t 1 is equal to the path impact set of the branch instance B 1 5,t,t 1 .In other words, P 1 5,t 1 = P 1 5,t,t 1 = {s 1 11,t 1 }.Otherwise, if a statement instance is not a controlling expression instance and does not have any impact successors, then we set its generalized path impact set to an empty set.

The Generalized Path Impact Relationship between a Statement Instance and Its Direct Impact Successors
According to the relationship between the impact precursor and the impact successor, we get the following conclusion: If the statement instances s p n ,t k are all direct impact successors of the statement instance s h g,t k , then Example 10.In Program 1, s 2 5,t 1 , s 2 6,t 1 and s 2 7,t 1 are all direct impact successors of the statement instance s

Latent Impact Factor
The fault in a statement may cause some program branches that have not yet been executed to be executed.The latent impact factor expresses the impact of these branches to be executed on the program output.

Latent Impact Factor of the Program Statement
Contrary to the branch instances that will no longer be executed, some branch instances may be going to be executed due to the error generated from the statement instance s h g,t k .These branches to be executed may change the program outputs.For an example, in Program 1, if the assignment statement s 2 is mutated into dist=m%n, then the remainder dist becomes zero when test case t 1 runs.In this situation, the true branch B 11,t of s 11 , which consists of s 12 and does not appear in the original execution path H 1 , will be executed and change the program output.
These branch instances to be executed are divided into two classes.In the first class, each branch instance contains statement instances.In the second class, each branch instance does not.The first class branch instances constitute the latent impact set of statement instance s h g,t k , and denotes as L h g,t k .The element in L h g,t k is called the latent impact element of the statement instance s h g,t k .The number of latent impact elements of s h g,t k is called the latent impact factor of s h g,t k and denoted as x li (s h g,t k ).A statement s g has multiple execution instances generally, and each of them has its own latent impact set.Therefore, the union of these latent impact sets is defined as the latent impact set of the statement s g , and denoted as L g .In other words, where K is the total number of test cases in the test suite, and H gk is the total number of times the statement s g is executed by the test case t k .The element in L g is called the latent impact element of s g .The number of latent impact element of s g is called the latent impact factor of the statement s g , and denoted as x li (s g ).

The Latent Impact Relationship between a Statement Instance and its Direct Impact Successors
According to the relationship between the impact precursor and the impact successor, we get the following conclusion: If the statement instances s Example 12. From Figure 1, we can know the the direct impact successors of the statement instance s 1 7,t 1 consist of s 2 5,t 1 , s 2 6,t 1 and s 2 7,t 1 .Given that L 2 5,t 1 = ∅, L 2 6,t 1 = {B 1 11,t,t 1 } and L 2 7,t 1 = {B 3 5,t,t 1 }, according to Formula (12), we have In addition, under the condition that we know L 2 5,t 3 = ∅, L 2 6,t 3 = ∅ and L 2 7,t 3 = {B 3 5,t,t 3 }, using the same method, we can still get Furthermore, we can get t,t 1 , B 3 5,t,t 3 }.

Latent Impact Set of the Special Statement Instance
If a statement instance s l r,t k is a controlling expression instance and the branch instance B l r,z,t k does not appear in the original execution path H k ; then, in the condition that B l r,z,t k is not empty, we set B l r,z,t k as the only element in the latent impact set of s l r,t k .If a statement instance is not a controlling expression instance and does not have any impact successors, then we set the latent impact set of s l r,t k to an empty set.
For example, as far as the controlling expression instance s 2 5,t 1 is concerned, although the branch instance B 2 5, f ,t 1 does not appear in the original execution path H 1 , B 2 5, f ,t 1 does not include any statement instance.Hence, B 2 5, f ,t 1 is not a latent impact element of s 2 5,t 1 , and we set the latent impact set of s 3 5,t 1 to an empty set.As far as the controlling expression instance s 3 5,t 1 is concerned, because the branch instance B 3 5,t,t 1 not only does not appear in the original execution path H 1 but also is not empty, we set s 3 5,t 1 as the only element in the latent impact set of B 3 5,t,t 1 .

Information Hidden Factor
The last feature of a statement is its information hiding feature.Sometimes, the program has multiple output statements, and some of them happen to generate same outputs.In this case, even if the software fault in a statement changes the execution path of the test case, the output of the program may still not be changed.
This phenomenon make the faults in statements difficult to identify.For a statement s g , we use the information hiding factor to express this feature.The information hiding factor of s g can be calculated in the following way.We use the test cases that execute s g to construct sub test suite T g .When we execute T g , the program under testing generates some outputs.The information entropy of the output distribute is called the information hidden factor of statement s g , and denoted as x ih (s g ).In other words, where p i is the probability that the test cases executing the statement s g generate the ith program output class.
Example 13.We calculate the information hidden factors of the statements s 9 and s 11 , respectively.In Program 1, the test suite consists of the test cases t 1 , t 2 and t 3 .These three test cases all execute statement s 9 .Their executions generate three program outputs (fac = 6, class 1), (fac = 1 class 2) and (fac = 6 class 1), respectively.Hence, the probability that the program output (fac = 6, class 1) is 0.67, and the probability that the program output (fac = 1 class 2) is 0.33.According to Formula (13), the information hidden factor of statement s 9 is 0.9182 bit.The test cases t 1 and t 2 execute the statement s 11 .Their executions generate two program outputs (fac = 6, class 1) and (fac = 1 class 2), respectively.Hence, the probabilities that the program output (fac = 6, class 1) and (fac = 6, class 1) are both 0.5.According to Formula (13), the information hidden factor of statement s 11 is 1.0 bit.

Calculation of Statement Features
First, we propose an iterative method to compute statement features, and then compare the time cost of this method with that of direct mutant testing.

Calculation Process
We divide the calculation of all the statement features into two parts.The first part calculation includes the first five statement features: the value impact factor, the path impact factor, the generalized value impact factor, the generalized path impact factor, and the latent impact factor.The second part calculation includes includes the last two statement features: the number of times a statement is executed, and information hidden factor.
The first part of the calculation takes much more time than the second one.For reducing the computational complexity, we propose an iterative method.Generally, if a statement instance has at least one impact successor, then we can calculate its first five features according to the formulas (2), ( 5), ( 8), (10), and (12).Otherwise, we use the methods mentioned in Sections 3.1.4,3.2.4,3.3.2,3.4.2and 3.5.3 to calculate its first five features.
The computation of the statement features is divided into two corresponding stages.The first stage, including steps 1-6, calculate the first part of statement features.The second stage including steps 7 and 8, calculate the second part of statement features.The overall computation steps are as follows: Step 1 Set test case serial number k = 1.
Step 2 Construct the execution impact graph G k of the test case t k .
Step 3 First, from the original execution path of the test case t k , find all statement instances that have not been analyzed.From these unanalyzed statement instances, find the last executed statement instance.We might as well denote this statement instance as s h g,t k .(1) If s h g,t k has one or more impact successors, then we construct the impact sets of its first five features according to the formulas (2), ( 5), ( 8), ( 10) and ( 12).
(2) If s h g,t k does not have any impact successors, then we construct the impact sets of its first five features according to the methods mentioned in Sections 3.1.4,3.2.4,3.3.2,3.4.2and 3.5.3.
Step 4 If there are some statement instances which appear in the original execution path of test case t k but have not yet been analyzed, go to step 3, else go to step 5.
Step 5 If test case t k is not the last test case in test suite, then k = k + 1, and go to step 2, else go to step 6.
Step 6 First, construct each program statement's value impact set, path impact set, generalized value impact set, generalized path impact set and latent impact set by formulas (1), ( 4), ( 7), ( 9) and (11).Next, for each program statement , calculate its value impact factor, path impact factor, generalized value impact factor, generalized path impact factor factor and the latent impact factor.
Step 7 For each statement in program under testing, compute the total number of times it is executed by the test cases in the test suite.
Step 8 For each statement in the program under testing, compute its information hidden factor by formula (13).
Example 14.We illustrate the above process by extracting the features of each statement in Program 1.In terms of the first stage of extracting the statement features, whether the statement instances are generated during the execution of test case 1, test case 2 or test case 3, the methods for calculating features of the statement instance are the same.Therefore, with regard to steps 1 to 5, we only explain in detail how to calculate the features of the statement instances generated during test case t 1 execution.The detailed calculation process is as follows.
We first set k = 1, execute test case t 1 , and construct the execution impact graph G 1 of test case t 1 as shown in Figure 1.
Similar to the above procedure, we can continue to calculate the above first five impact sets for all statement instances generated during the executions of test case 2 and test case 3, respectively.Thus far, steps 1-5 are completed, and their final results are shown in Table 3.
After calculating the first five impact sets of each statement instance, based on Table 3 and Formulas (1), ( 4), ( 7), ( 9) and ( 11), we can get the first five impact sets of each statement, as shown in Table 4.The corresponding impact factors are shown in Table 5.
After calculating the first five impact factors of each statement, we calculate the execution number of each statement, as shown in Table 6.
Finally, similar to Example 13, we use the Formula (13) to calculate the information hidden factor of each statement.The calculation process are shown in Table 7.

Computational Complexity Analysis
Through the analysis of computational complexity, we can draw the following conclusion: Compared with the time required for direct mutation testing, the time used to calculate all statement features can be neglected.The computation time of statement features consists of two parts.The first part of time overhead is used to calculate the value impact factor, path impact factor, generalized value impact factor, generalized path impact factor and potential impact factor.The calculation of these features is relatively complex.The second part of time overhead is used to calculate the other two statement features.The calculation of these two features is relatively simple.Therefore, we can approximatively consider the first part of time overhead as the total time overhead for computing all statement features.In this situation, to prove the conclusion, we only need to prove that the first part of time overhead is much lower than that used to directly execute mutation testing.We can get this conclusion from the following four steps: In the first steps, we first suppose the time overhead that the computer spends to executes one statement once is T 0 .According the Section 4.1, we can get the following conclusion: the time overhead used to compute a factor of a statement instance is also roughly equal to T 0 .If a statement instance has at least one impact successor, then we use the formulas (2), ( 5), ( 8), ( 10) and ( 12) to calculate its first five impact factors, respectively.Otherwise, this statement does not have any impact successors, and we use the method in the Sections 3.1.4,3.2.4,3.3.2,3.4.2and 3.5.3 to calculate them, respectively.Whichever method is used, the calculation is simple.Therefore, we can consider that the time overhead used to compute an impact factor of the statement instance is roughly equal to T 0 .

Statement Ratio (fac = 6 class 1) Ratio (fac = 1 class 2) Information Hidden Factor
In the second step, we can conclude that the time overhead used to calculate all factors of all statement instances is roughly equal to five times ∑ G g=1 ∑ K k=1 H gk T 0 , where G is the total number of statements in program under testing, K is the total number of test cases in test suite, and H gk is the number of times statement s g is executed by the test case t k .Because statement s g generates ∑ K k=1 H gk execution instances, in terms of the program under testing, the total number of executed statement instances by the tests suite is ∑ G g=1 ∑ K k=1 H gk .Combining with the conclusion in the first step, we get the conclusion: In terms of the program under testing, the time overhead for computing all features of all statement instances is roughly equal to five times ∑ G g=1 ∑ K k=1 H gk .In the third step, we can conclude that the time overhead for direct mutation testing is ∑ G g=1 ∑ K k=1 n g |P k |T 0 , where we suppose that the statement s g generates n g mutants, and the test case t k executes |P k | statement instances.In the direct mutation testing, the program under testing generates ∑ G g=1 n g mutants, each mutant is tested by the test suite, and the time overhead for the test suite to test each mutant is ∑ K k=1 |P k |T 0 .Therefore, the time overhead for direct mutation testing is In the fourth step, we compare the time overhead used to calculate all features of all statement instances and the overhead used in the direct mutation testing.The ratio of the two time overheads is 5 H gk , we can get the final conclusion: Compared with the time required for direct mutation testing, the time overhead used to calculate all statement features can be neglected.

Machine Learning Algorithms Comparison and Modelling
Taking the Brier scores as a criterion, we compared the prediction effects of the following five models on statement mutation scores: artificial neural networks (ANN), logical regression (LR), random forests (RF), support vector machines (SVM) and symbolic regression (SR).The experiment result shows the artificial neural network algorithm has the highest prediction precision.We did not try very complex models because the model should not be too complicated.First, our sample size should not be too large.Our data records need to be extracted in real time, so that the excessively large sample size will cause the user to wait a long time.In the case of a small sample size, over-complexing models can cause over-fitting.Secondly, according to the introduction of the Section 3, we can know that the relationship between the dependent variable and each independent variable is monotonic, so we estimate that the available model should not be very complicated.

Experimental Subjects
In this paper, there are two programs under testing: schedule.cand tcas.c.We explain our experiment with schedule.cas the main part and tcas.c as the auxiliary part.The program schedule.crealizes a CPU process management, and the program tcas.c realizes an aircraft early warning system.A more specific introduction is as follows.
The program schedule.c[16] realizes a priority scheduling algorithm.A computer has only one CPU, but sometimes multiple programs simultaneously request to be executed.For solving this problem, the priority scheduling algorithm assigns each program a priority.When a program needs to use CPU, it is first stored in a queue so that the program with a higher priority gets a CPU, whereas the program with a lower priority can wait.The schedule.cconsists of 73 lines of C code including one branch statement, two single-loop statements and two double-loop statements.The test cases are included in its usage instructions.We take these test cases as a test suite of schedule.c.
The program tcas.c [17] is used to avoid collision of aircraft, which consists of 135 lines of C code with 40 program branch statements and 10 compound predicates.The tcas.c is able to monitor the traffic situation around a plane and offer information on the altitude of other aircraft.It can also generate collision warnings that another aircraft is in close vicinity by calculating the vertical and horizontal distances between the two aircrafts.The Software artifact Infrastructure Repository (SIR) also supplies some types of test case suites for tcas.c.From the SIR, we randomly selected a branch coverage test suite suite122 as the test suite used in our experiment.

The Construction Method of Data Set
To compare the prediction accuracy of the five machine learning models, we did two experiments with schedule.cand tcas.c,respectively.No matter the experiment, the data set is created in the same way.In each experiment, the data set contains 200 data records.Each data record r p is established with one corresponding mutant sample m p and contains seven independent variables and one dependent variable.If m p is generated by modifying the statement s q , then the seven independent variables of r p are the seven features of the statement s q , and the dependent variable of r p is the identification result of the mutant m p .
We take an example to explain the construction process of a data record.We might as well assume that a mutant sample m p is generated by modifying statement s 2 and identified by the test suite.Now, we use m p to construct one data record r p .Because m p is generated by modifying s 2 , the values of seven independence variables in r p are the seven features of statement s 2 , i.e., (1, 4, 2, 1, 2, 1, 0) as shown in Tables 5-7.Because m p is identified by the test suite, the value of dependence variable in r p is 1.Therefore, the data record r p is (1, 4, 2, 1, 2, 1, 0, 1).

Performance Metrics
A model may be considered good when it is evaluated using a metric, but, at the same time, the model may be considered bad when assessed against other metrics.For this reason, we will compare a few different common evaluation metrics and decide which of them is more suitable to our statement mutation score prediction.

Area under Curve
The two coordinates of the receiver operating characteristic (ROC) curve represent sensitivity and specificity, respectively.Through these two indicators, the ROC curve displays the two types of errors for all possible thresholds.The area AUC under the ROC curve is the quantitative indicator commonly used to evaluate a binary classification algorithm [18].

Logarithmic Loss
Logarithmic Loss works by penalising the false classifications [18].It works well for both binary classification and multi-class classification generally.For a binary classification, the logarithmic function is often used as a classifier's loss function.Logarithmic Loss closer to 0 indicates higher accuracy for the classifier.

Brier Score
The basic idea of Brier score is to compute the mean squared error (MSE) between the predicted probability scores and the true class indicator [19], where the positive class is coded as y i = 1, and negative class y i = 0.The most common formulation of the Brier score is shown as follows: The Brier score is a loss function, which means the lower its value, the better the machine learning model.

Metric Comparison
In the cross-validation process, we choose the Brill score as the model evaluation criterion.Our purpose is only to tell our users how likely the software bug in a statement will be detected by a test suite.Therefore, AUC is not suitable for us because it is also not directly related to the predicted probability.Because the logarithmic loss function may lead to an infinite penalty, it is also not used by us.The Brier score is the good score function because it is related to the predicted probability and is bounded.For the above reasons, we take the Brier score as an evaluation criterion in the cross-validation.

Model Comparing and Tuning
Under the condition of the same partitioning of the data set, we take the Brier score as a standard to evaluate the model.In our experiment, we tune hyperparameters and compare the prediction accuracies of five machine learning models.We use the same partitioning of the data set and the repeated 5-fold cross-validation to evaluate the prediction accuracy of the models because of the two following reasons.
(1) We tune some hyperparameters to find the optimal model settings with the help of the repeated 5-fold cross-validation method.During the 5-fold cross-validation, the samples are randomly partitioned into five equally sized folds.Models are then fitted by repeatedly leaving out one of the folds.In our each experiments, our data set contains 200 data records, so that the training and validation sets contain 160 and 40 data records, respectively.However the result from cross-validation is more or less uncertain generally.Therefore, in our experiment, five repeats of 5-fold cross-validation are used to effectively reduce this uncertainty and increase the precision of the estimates.Because each 5-fold cross-validation supplies a Brier score, five repeats of 5-fold cross-validation supply 5 Brier scores.Under each candidate combination of hyperparameters, we use the average of the five Brier scores to represent the prediction effects of the corresponding model.
(2) Because the performance metric is sensitive to the data splits, we thus compare the machine learning models based on the same partitioning of the data.Otherwise, the difference in performance will come from two different sources: the differences among the data splits and the differences among the models themselves.If one model is better than the other, we don't know if all performance differences are caused by model differences.
The compared models include the logistic regression, random forest, neural network, support vector machine and symbolic regression.We use their average Brier scores to assess their prediction effects.

Logistic Regression (1) Introduction to Logistic Regression
Conventional logistic regression [20,21] can predict the occurrence probability of a specific outcome.The conditional probability of a positive outcome could be expressed with the formula below: , where β i is the coefficient for the ith feature, and d is the total number of features.β 1 , β 2 , • • • , β d can be solved by the elastic net approach [22][23][24] as follows: where (2) Logistic regression tuning Glmnet [25] is an R language software package that can fit linear, logistic and multinomial, Poisson, and Cox regression models by maximizing the penalized likelihood.In order to predict the mutant score of each program statement in schedule.c,we use the ridge penalty algorithm in a glmnet software package to fit the logistic regression mode.Hence, during tuning hyper parameters, the penalized parameter α in the formula ( 14) is set to 0, and the penalized parameter λ is set to 10 i where i takes each integer from −7 to 7 in turn.In the cross-validation process, we use the Brier score as the model evaluation criterion.Under each penalized parameter, the five repeats of 5-fold cross-validation generate five Brier scores.We calculate the average of the five Brier scores under each candidate penalized parameter λ, so that we can use the average Brier score to represent the prediction effect of the model under the each candidate penalized parameter.
Figure 4 and Table 8 show the average Brier Score under each candidate value of the penalized parameter λ .In Figure 4, the profile shows a decrease in the average Brier score until the penalized value λ is 10 −2 .Therefore, the numerically optimally value of the penalized parameter is 10 −2 .Mean 0.1381 0.1381 0.1381 0.1381 0.1381 q q q q q q q q q q q q q q q −6 −4 −2 0 2 4 6 0.10 0.11 0.12 0.13

Random Forests (1) Introduction to Random Forest
The random forest model [26,27] can work for regression tasks and classification tasks generally.It is a tree-based model consisting of multiple decision trees.Each decision tree is created on an independent and random sample taken from the training data set.
The decision tree algorithm [18,28] is a top-down "greedy" approach that partitions the dataset into smaller subsets.This algorithm has a tree-like structure that predicts the value of a target variable based on several input variables.At each decision node, the features are split into two subsets and this process is repeated until the number of data in the splits falls below some threshold.According to the target variable's type, decision trees can be divided into regression trees and classification trees.The purpose of classification tree is to classify, and its target variable takes discrete values.The purpose of regression trees is to build a regression model, its target variable takes continuous values.
For regression, the regression tree algorithm begins with the entire data set S and searches every distinct value of every independent variable to find the appropriate independent variable and split its value that partitions the data into two subsets (S 1 and S 2 ) such that the overall sums of squares error are minimized, where y 1 and y 2 are the averages of the outcomes within subsets S 1 and S 2 , respectively.Then, within each of subsets S 1 and S 2 , this method searches again for the independent variable and splits its value that best reduces SSE.Because of the recursive splitting nature of regression trees, this method is also known as recursive partitioning.
For classification, the aim of classification trees is to partition the data into smaller, more homogeneous groups.Homogeneity in this context means that the nodes of the split are more pure.This purity is usually quantified by the entropy or Gini index.For the two-class problem, the Gini index for a given node is defined as where p 1 and p 2 are the probabilities of Class 1 and Class 2, respectively.In order to make a prediction for a given observation, the regression tree first analyzes which class this observation belongs to, and then takes the mean of the training data in the class as the prediction of this observation.When random forest algorithms are used, the result of regression question can be obtained by averaging predictions across all regression trees, and the result of the classification question can be obtained by a majority vote across all classification trees, respectively.The generalization error of a random forest depends on the errors of individual trees and the correlation between the trees.
(2) Random forest tuning The randomForest package [29] implements the random forest algorithm in the R environment.We use this software to predict statement mutation scores generated when the test suite executes on the program schedule.c.Because the statement mutation score can be considered as the probability of positive class in binary classification, we denote the positive class and negative class as 1 and 0, respectively, and let Random Forests run under regression mode to predict the probability of a positive class [30].In order to obtain a good prediction model, the different hyper parameter combinations are tried.The most important hyper parameter is mtry, which is the number of independent variables randomly selected at each split.In our experiment, we tried multiple candidate values of mtry (from 1 to 7).The other important tuning parameter is ntree, which is the number of bootstrap samples in the random forest algorithm.In theory, the performance of a random forest model should be a monotonic function of the number of trees (ntree).However, when ntree is greater than a certain number, the performance of a random forest model can only improve slowly.In our experiment, ntree is set to 1000.Under each candidate value of the parameter mtry, we calculate the average of the five Brier scores generated from the five repeats of 5-fold cross-validation.Furthermore, we use these averages to express the prediction effects of the random forest model under different candidate values of hyperparameter mtry.Figure 5 and Table 9 show the average Brier score under each candidate value of the hyperparameter mtry.As shown in Figure 5, the average Brier scores show a U shape, whose minimum value occurs in mtry = 3.Therefore, 3 is the optimal value of mtry.Mean 0.1166 0.0923 0.0888 0.0897 0.0914 0.0928 0.0925 q q q q q q q 1 2 3  Neural networks [18,31] can be used not only for regression but also for classification.The outcome of a neural network is modeled by an intermediary set of unobserved variables called hidden units.The simplest neural network architecture is the single hidden layer feed-forward network.
The working process of the single hidden layer feed-forward neural network is as follows.During the entire work of the neural network, all input neurons representing the original independent variables x 1 , x 2 , • • • , x s are first activated through the sensors perceiving the environment.Next, inside each hidden unit h k , all original independent variables are linearly combined to generate where k = 1, 2,• • • , r and r is the number of the hidden units.Then, by a nonlinear function g k , u k (x) is typically transformed into the output of hidden unit h k as follows: (i) When treating the neural network as a regression model, all g k (x) are linearly combined to form the output of neural network: All of the parameters β and γ can be solved by minimizing the the penalized sum of the squared residuals: where f (x i ) and y i are the predicted result and the actual result related to the ith observed data, respectively.
(ii) Neural networks can also be used for classification.Unlike neural networks for regression, an additional nonlinear transformation is used on the linear combination of the outputs of hidden units.
When the neural network is used for binary classification, it uses (21) to predict the class probability.The estimation of the parameters γ and β can be solved by minimizing the penalized cross-entropy where y i is the 0/1 indicator for the positive class.The neural network algorithm can also be used for multi-class classification.In this situation, the softmax transform outputs the probability that the sample x belongs to the lth class.Except the single hidden layer feed-forward network, there are many other types of models.For example, the famous deep learning approaches consist of multiple hidden layers.
(2) Neural network tuning As we said before, the our model must not be too complicated, so we select R package nnet [32] to predict the statement mutation scores of the test suite on schedule.c.The software package nnet implements a feed-forward neural network with a single hidden layer.The λ and r in formula (22) represent the weight decay and the number of units in the hidden layer, respectively.They are denoted as decay and size in nnet package, respectively.Therefore, decay is the regularization parameter to avoid over-fitting.
In our experiment, size is set in turn to each integer value between one and then.At the same time, the decay was set to 10 i where i takes each integer value from −4 to 5 in turn.
Figure 6 and Table 10 show the average Brier scores under the each candidate combinations of size and decay.From them, we can know that the optimal combination of the weight decay and hidden unit number is decay = 10 −2 and size = 8 because, at this time, the minimum average Brier score appears.

Nnet tuning
Weight Decay (log 10 − transformed) Average Brier Score q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q hidden unit number where w is the normal vector of the maximum-margin hyperplane w T x + b = 0, C is the regularization parameter, ζ i indicates a non-negative slack variable to tolerant some training data falling in the wrong side of the hyperplane, and b is a bias.The parameter C specifies the cost of a violation to the margin.When C is small, the margins will be wide and many support vectors will be on the margin or will violate the margin.When C is large, the margins will be narrow and there will be few support vectors on the margin or violating the margin.
The maximum-margin hyperplane can be obtained by solving the above problem.Given new data x, f (x) = w T x + b represents the signed distance between x and the hyperplane.We can classify the new data x based on the sign of f (x).
If the original problem is stated in a finite-dimensional space, it often happens that the sets to discriminate are not linearly separable.For solving this problem, a support vector machine maps the original finite-dimensional space into a higher-dimensional space, making the separation easier.
Let φ(x) denote the vector after x mapping.In this higher-dimensional space, the optimization problem can be rewritten into min or expressed in the dual form where k(x i , x j ) = φ(x i ) T φ(x j ) defines the kernel function greatly reducing the computational cost.By solving the above optimization problem, the optimal α * i and b * can be obtained.Therefore, the maximum-margin hyperplane in the higher-dimensional space is The kernel trick allows the support vector machine model to produce extremely flexible decision boundaries.The most common kernel functions are listed in Table 11:

Name
Expression Parameter The original SVM can be used for classification and regression without probability information.To solve this problem, Platt [33] proposed to use a logistic function to convert the decision value from a binary support vector machine to a probability.Formally, the probability of data x i being a positive instance is defined as follows: where f (x) = w T φ(x) + b is the maximum-margin hyperplane.The parameters A and B are derived by minimizing the negative log-likelihood of the training data: where n + denotes the number of positive training instances (i.e., y i = 1), and n − denotes the number of negative training instances (i.e., y i = −1).Newton's method with backtracking is a commonly used approach to solve the above optimization problem [34] and is implemented in LibSVM.Besides the binary classification, the support vector machine can also compute the class probabilities for the multi-class problem using one-against-one (i.e., pairwise) approach [35] .
(2) Support vector machine tuning Support vector machine algorithms are provided in the software package kernlab [36] written in the R language.We built the support vector machine based on the radial basis kernel function provided by this package.A radial basis kernel function maps the independent variables to an infinite-dimensional space.The regularization parameter C in formula ( 23) is called cost parameter in kernlab.A smaller C results in a smoother decision surface and a larger C results in a flexible model that strives to classify all training data correctly.The radial basis kernel function in kernlab package is shown in Table 11, where the parameter σ represents the inverse kernel width.A larger σ means a narrower radial basis kernel function.
When we use kernlab to predict the statement mutation scores of schedule.c,we hope to get the Brier score as small as possible by tuning C and σ.For this purpose, we first set the parameter σ to the median of x − x 2 [18,37,38].Next, let the parameter C take respectively as 2 −5 , 2 −3 , 2 −1 , 2 1 , 2 3 , 2 5 , 2 7 , 2 9 , 2 11 , 2 13 and 2 15 .Then, under each candidate value of C, we use the five repeats of 5-fold cross-validation to calculate the average Brier scores.
Figure 7 and Table 12 show the average Brier score generated by five repeats of 5-fold cross-validation at each candidate value of C. As shown in Figure 7, although there was a relatively large fluctuation, the average Brier score shows a general trend of first decreasing and then rising.From this figure, we can know that C = 2 −1 is the optimally value of the regularization parameter.At this time, the average Brier score reaches a minimum 0.0933.q q q q q q q q q q q −5 0  Symbolic regression can also be called function modeling.Based on the given data, it can automatically find the functional relationship, such as 2x 3 + 5, cos(x) + 1/e y , etc., between independent variables and dependent variables.
Throughout the modeling process, a function model f (x) is always coded as a symbolic expression tree.The input of symbolic regression is a data set, and the genetic programming method is often used to determine f (x).The genetic programming constantly changes an old function model into a new better fitted one by selecting the function with the better fitness value.A possible and frequently used fitness function is the average squared difference between the values predicated by f (x) and the actually observed values y as follows: Mutation operations and crossover operations are the two important ways to change function model f (x).A mutation operation directly changes a symbolic expression sub tree, and a crossover operation cuts a symbolic expression sub tree and replace it with a sub tree in another symbolic expression tree.
(2) Symbolic regression tunning The symbolic regression tool rgp [39] is an implementation of genetic programming methods in the R environment.We use rgp to predict the statement mutation scores of schedule.c.In our symbolic regression experiment, the most basic mathematical operators are set to the operators +, −, *, /, sin.An important tuning parameter in rgp is populationSize, which means the number of individuals included in a population, and is set to 100 in our experiment.
Another important tuning parameter is the number of evolution generations.Too few evolutionary generations produce an under-fitting, whereas too many evolutionary generations produce an over-fitting.We did a grid search to determine the optimal number of the generations, which minimizes the average Brier score.Because we need to complete the model fitting in a relatively short time, the number of evolution generations cannot be set too large.In our experiment, the candidate number of evolution generations is set to 3, 6, 9, 12, 15, 18, respectively.The five repeats of five-fold cross-validation are used to calculate the evolution effects (i.e., the average Brier scores) under each candidate number of evolution generations.
As shown in Table 13 and Figure 8, the average Brier scores oscillated down.In the 12th generation evolution, the smallest average Brier score 0.1504 appeared.

Comparing Models
Once the hyper parameters in the above five models have been determined for the above five models, we face the question: how do we choose among multiple models?The logistic regression model is used to set the baseline performance because its mathematical expression is simple and operation speed is fast.If other predictive models do not surpass it, the logistic regression model is used in future actual forecasting.
The boxplot in Figure 9 shows, under the condition that the Brier score is the standard, the neural network does the best job about predicting the statement mutation scores.The second best is the random forest model, which is a little better than the support vector machine model.The logistic model is second to last and greatly exceeded the symbolic regression.

Testing Predictions in Practice
According to Figure 9 and Tables 8-10, 12 and 13, we know that, in the process of repeated cross-validation, the average Brier scores of the logistic regression, random forest, neural network, support vector machine and symbolic regression are 0.0950, 0.0888, 0.0856, 0.0933 and 0.1504, respectively.Therefore, the neural network is the best model because its average Brier score is lower than other models.To further demonstrate the predictive effect of the neural network model on the schedule.c,we did the two following things.Firstly, we apply the neural network model, whose hyper-parameters have been tuned according to the method in Section 5.4.3, to predict the statement mutation scores of schedule.c.Under the condition that the schedule.cis used as the experiment subject, we calculate the mean absolute error between all statement mutation scores obtained by the neural network prediction and all real statement mutation scores.The experiment result shows the mean absolute error reaches 0.1205.Secondly, we randomly select 34 statements in schedule.c,and their two kinds of statement mutation scores are shown in Figure 10.In this figure, the horizontal coordinate represents the real statement mutation score, and the vertical coordinate represents the statement mutation score predicted by the neural network model.Each circle represents a statement, and the distance between each short dashed line and diagonal line is 0.1.From this figure, we can see that more than 60% of the statements are located between the two short dashed lines.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0.

Further Confirmation of the Optimization Model
To further confirm that the prediction effect of the neural network is the best, we compared the five machine learning models again under the condition that the program tcas.c is used as the experimental subject.In the process of repeated cross-validation, the average Brier score of the neural network model reaches 0.1164.The average Brier scores of the logistic regression, the support vector machine, the random forest and the symbolic regression are 0.1233, 0.1249, 0.1289 and 0.1373, respectively.Therefore, the neural network is once again considered the best model because its average Brier score is lower than other models.To further demonstrate the predictive effect of the neural network model on the tcas.c,we apply the neural network model, whose hyper-parameters have been tuned according to the method in Section 5.4.3, to predict the statement mutation scores of tcas.c.Under the condition that the tcas.c is used as the experiment subject, the mean absolute error between real statement mutation scores and the statement mutation scores predicted by the tuned neural network reaches 0.1198.In order to illustrate the prediction results of the neural network more vividly, we randomly selected 31 statements in the program tcas.c.Their real statement mutation scores and the corresponding predicted mutation scores are shown in Figure 11.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.2 0.4 0.6 0. Through the above analysis, we can see that, whether the experiment subject is schedule.cor tcas.c, the average Brier scores of the neural network are both the minimums.Thus, we recommend the single hidden layer feedforward neural network as the best model.In the two experiments, the mean absolute error between the statement mutation scores predicted by the neural network model and the real statement mutation scores both approximately reach 0.12.

Structure of the Automated Prediction Tool
The work process of our automatic analysis tool consists of the five parts, as shown in Figure 12: extracting the features of the statements in the program under testing, generating mutants, executing test suite on the each mutants, establishing the neural network model, and predicting the statement mutation scores.
In the first part, we extract the features of statements in the program under testing.First, we execute each test case and construct its execution impact graph with the open source software giri [40].Giri was originally a dynamic program slice tool and is currently modified by us.Next, we traverse the statement instances in reverse order of the execution history of the test cases.Whenever we visit a statement instance, we compute its features.After calculating the features of each statement instance, we calculate the features of each program statement according to the corresponding the statement instances.
In the third part, we execute the test suite on each mutant and record the corresponding identification result.
In the fourth part, we take the features of the mutant as independent variables and the identification result of the mutant as dependent variables to construct the prediction model with the neural network.
In the fifth part, we predict the mutation scores of each program statement with the constructed model.

Conclusions
In this paper, we predicted statement mutation scores while using a single hidden layer feedforward neural network and seven statement features.As analyzed in Section 5, each experimental result shows that the neural network is the best prediction model from the standpoint of the mean absolute error.The experimental results on two c programs demonstrate that our method can directly predict statement mutant scores approximately.The experiment results also show that the seven statement features that represent the dynamic program execution and testing process can basically reflect the impact of statements on program output.
However, two shortcomings need to be improved.Firstly, a part of statement features weakly related to program outputs still need to be discovered.If the real mutation score of a statement is low, then this statement usually only has some statement features weakly related to program outputs.In this case, the prediction effect of my model is not good because we only found a part of weakly relevant statement features, and the other part of the weakly correlated sentence features still need to be discovered.
Secondly, in this paper, we assume the controlling expression has no side effect.However, in a few cases, a controlling expression has a side effect.In this case, the execution instance of the controlling expression may have some impact successors.For example, if there is a controlling expression if(x > y++) in the original program, then the execution result of y++ will be changed when it is executed a test case, so that it impacts subsequent statement instances containing y variables.In this situation, the methods mentioned in Sections 3.1.4,3.2.4,3.3.2,3.4.2and 3.5.3are no longer applicable, and the corresponding algorithm needs to be redesigned.
In the future, we also plan to predict the statement mutation scores with the prediction model established by other programs.In this case, the users can train a prediction model with the data records from other programs beforehand.Using this pre-trained model, users can directly predict the statement mutation scores of the current program.

Figure 4 .
Figure 4.The performance profile of the logistic regression for predicting the statement mutation scores.

Figure 5 . 1 )
Figure 5.The performance profile of the random forest for predicting the statement mutation scores.5.4.3.Artificial Neural Networks (1) Introduction to neural network

Figure 6 . 1 )
Figure 6.The performance profile of the neural network for predicting the statement mutation scores.5.4.4.Support Vector Machines (1) Introduction to support vector machine Given a set of n training instances x 1 , x 2 , • • • , x n , the goal of support vector machine is to find a hyperplane that separates the positive and the negative training instances with the maximum margin and minimum misclassification error.The training of support vector machine is equivalent to solving the following optimization problem:

Figure 7 .
Figure 7.The performance profile of the support vector for predicting the statement mutation scores.

Figure 9 .
Figure 9.Comparison of the Brier scores of the five machine learning models for schedule.c.

Figure 10 .
Figure 10.Comparing the real statement mutation scores and the predicted statement mutation scores in schedule.cby the artificial neural network.

Figure 11 .
Figure 11.Comparing the real statement mutation scores and the predicted statement mutation scores in tcas.c by the artificial neural network.

Figure 12 .
Figure 12.The structure of Automated Analysis Tool.

Table 1 .
Main differences among mutation reduction methods.

Table 2 .
The execution history of the test cases.
} Definition 4. Original execution path of the test case.
The execution impact graph G 2 formed when Program 1 is executed by test case 2.
B Figure 1.The execution impact graph G 1 formed when Program 1 is executed by test case 1.
= ∅ and P 2 7,t 3 = ∅, we can use Formula (9) to calculate the generalized value impact set of the statement s 7 .

Table 3 .
The first five impact sets of all statement instances in Program 1.

Table 4 .
The first five impact sets of all statements in Program 1.

Table 5 .
The first five impact factors of all statements in Program 1.

Table 6 .
The execution number of each statement in Program 1.

Table 7 .
Computation for information hidden factor of each statement in Program 1.

Table 8 .
Average Brier scores for the logistic regression model.

Table 9 .
Average Brier scores for the random forest model.

Table 10 .
Average Brier scores for neural network models.

Table 12 .
Resampled Brier score for the support vector machine model.

Table 13 .
Average Brier scores for the symbolic regression.