On the Effectiveness of Using Elitist Genetic Algorithm in Mutation Testing

: Manual test case generation is an exhaustive and time-consuming process. However, automated test data generation may reduce the efforts and assist in creating an adequate test suite embracing predeﬁned goals. The quality of a test suite depends on its fault-ﬁnding behavior. Mutants have been widely accepted for simulating the artiﬁcial faults that behave similarly to realistic ones for test data generation. In prior studies, the use of search-based techniques has been extensively reported to enhance the quality of test suites. Symmetry, however, can have a detrimental impact on the dynamics of a search-based algorithm, whose performance strongly depends on breaking the “symmetry” of search space by the evolving population. This study implements an elitist Genetic Algorithm (GA) with an improved ﬁtness function to expose maximum faults while also minimizing the cost of testing by generating less complex and asymmetric test cases. It uses the selective mutation strategy to create low-cost artiﬁcial faults that result in a lesser number of redundant and equivalent mutants. For evolution, reproduction operator selection is repeatedly guided by the traces of test execution and mutant detection that decides whether to diversify or intensify the previous population of test cases. An iterative elimination of redundant test cases further minimizes the size of the test suite. This study uses 14 Java programs of signiﬁcant sizes to validate the efﬁcacy of the proposed approach in comparison to Initial Random tests and a widely used evolutionary framework in academia, namely Evosuite. Empirically, our approach is found to be more stable with signiﬁcant improvement in the test case efﬁciency of the optimized test suite.


Introduction
Test data generation is a critical, labor-intensive, and time-consuming process that significantly affects software quality.However, automation can minimize the effort and may produce effective test cases satisfying specific objectives.Owing to a combinatorial problem, which is computationally intractable, different search-based algorithms have been proposed and used for generating the test suite [1][2][3][4][5].These algorithms include Genetic Algorithm (GA), Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO), among others.Initially conceived by Holland [6], GA [7,8] is frequently adapted by the researchers and provides an evolved test suite through iterative searching of the search space.In each iteration, the fitness of the test suite is measured, and for convergence, it must satisfy some test requirements, i.e., branch coverage, statement coverage, and path coverage.Mutation testing is a type of structural software testing that inserts the fault in the source code of a program and makes it faulty.These faults exhibit the mistakes a programmer can make while writing the program, and the faulty version is known as a mutant.Test data is required to reveal a fault in the program.Here, test data are the inputs to the program and executing them against a mutant indicates whether the fault is exposed or not.Fault exposition is also known as mutation coverage.Prior studies [9,10] suggest that mutation coverage can be considered to generate a superior test suite than other coverage measures and better guides the selection mechanism of test cases for evolution.
Mutants are widely accepted and simulated artificial faults that behave similarly to realistic ones [11,12] for test data generation [9,13].These are created by systematic injection of faults using predefined mutation operators [14].Mutation testing was initially suggested by DeMillo [15] and later explored by different researchers [16][17][18].Execution of a test case (test inputs) against these faults results in the adequacy score of that test case.This score is also known as mutation score (%), which is measured using the killed mutants (covered faults, KM) and the total non-equivalent mutants (M) and is expressed as [|KM| × 100/|M|].However, some of the mutants are not recognized by any of the test cases because they do not differ from the original program.Therefore, such mutants are referred to as equivalent and adversely impact the test suite performance [19,20].The problem of identification of equivalent mutants has also been addressed in prior studies [21][22][23].Apart from the several benefits including a reduction in the search space and providing the framework for test suite quality assessment, mutation testing also suffers from a high computational cost.However, in the last few decades, researchers have tried to minimize this cost by using various techniques which usually follow do-fewer, do-smarter, or do-faster kinds of strategies [17,18,[24][25][26][27][28][29][30][31][32][33] and more studies can be traced in a recent survey [34].Selective mutation, which is used in the current study, is one of these approaches that generate mutants by applying some operators from the large set of mutation operators [14] (Section 2.1).
Search-based techniques with mutation testing have also been extensively researched and can be traced in the relevant surveys [35][36][37][38][39].According to these studies, GA is more preferred by the research community.Literature in the test data generation field is reviewed and summarized in Table 1 in chronological order.It contains the records of publication details, tool availability, type of mutants created, search technique implemented for test generation, population size, and fitness function.
Initially, Baudry et al. [40,41] reported the natural killing of 60% mutants using the first set of component test cases.However, to detect more hard-to-kill mutants (90% mutants), they suggested to use GA to iteratively evolve the test cases.The same fitness function (mutation score) was used by other researchers as well to find a mutation adequate test suite in [42][43][44][45].Application of GA with mutation testing has also been applied in finite state machines (FSM) by Molinero et al. [46] and Nilsson et al. [47].
Later, a new idea for mutant identification and formulating fitness function was suggested by Bottaci [48] and implemented by [49,50].Following the related idea of fitness evaluation, Fraser [9,10] employed GA with weak mutation testing in a tool known as Evosuite.It automatically generates test cases using assertions.The performance of this tool is further demonstrated and compared in other prominent studies [51][52][53][54][55].
Considering the object's state, Bashir and Nadeem [56] proposed a novel fitness function, which restricts the search process by reviewing the tests that have either obtained the desired state or requires more method calls.The authors in [57][58][59] further extended the work that resulted in the development of a tool named eMuJava.They also compared the relative performances of their proposed variation of GA with traditional GA using ten Java programs of total 1028 LOC.In their study, improved GA converged in 373 iterations and created 9325 test cases that detected 93.5% mutants for triangle program.C++ mutation operators were also used by Perez et al. [60][61][62] for mutant selection and test suite improvement.Higher-order mutants were also examined for test creation using GA by Ghiduk [63].Baudry [40,41] 2000 No 1 TM GA -Mutation Score Masud [49] 2005 No -TM GA -Botacci Fitness function The study in this paper expands our previous work [45,64], dealing with GA and mutation testing.However, this work presents an improved fitness evaluation, incorporation of elitism, and performance comparison with the existing techniques.It implements a variant of GA by effectively blending the benefits of mutation testing for non-redundant test suite generation, followed by a novel fitness function that considers test case complexity in terms of time-steps with high fault exposure.Test case complexity impacts the process of testing at a more considerable extent for finding the faults.In this exposition, the performance of the proposed approach is compared with a popular testing tool, i.e., Evosuite [10] as well as with Initial Random tests.The contributions of this study are:

•
Implementation of GA using the idea of diversification and intensification along with the integration of elitism and mutation-based fitness function.It addresses the problem of costly test suite with fault revealing abilities.

•
Comparison of the effectiveness, efficiency and cost of the proposed approach with the state-of-the-art techniques on 14 Java programs on different evaluation metrics.

•
Analyzing the impact of other artificial faults on the effectiveness of generated test suite.
This paper is organized as follows: Section 2 presents the basic terminologies with the proposed approach and an illustration of its execution.Section 3 describes the experimental setup and information of mutants used and Section 4 discusses the results of the evaluation with some limitations.Section 5 concludes with the significant findings of this study.

(a)
Software Testing: It is the process of executing a program with the intent of finding the faults [65].Actual output and expected output of executing a test case are compared and if they differ then it is said that fault is present.

(b)
Test Case: A test case is an input to the program with its expected output and is used for testing the functionality of the program [65].A collection of test cases is called a test suite, e.g., for a single input problem, a test case can be {T 1 = 7}, while for two input problem, {T 1 = (8, 4)}.(c) Mutation Testing: It is a method of software testing that seeds the faults or errors in the program with a precondition of the syntactical correctness of the altered program [16,18].
(d) Mutants: The faulty version of a program is known as mutants.A mutant with a single fault is characterized as a single order mutant while those with more than one fault are higher-order mutants.(e) Mutation Operators: Mutants are generated using some metagenic rules [14,66] which seeds the fault in the program systematically.These metagenic rules are termed as mutation operators in mutation testing (Figure 1). Figure 2  GA and its Operators: GA is an evolutionary algorithm based on the concept of natural genetics of reproduction [6][7][8].In an iteration of execution, it starts with the random initial population P, fitness evaluation of P, selection, reproduction (crossover and mutation) and stops re-iterating when an optimal solution is found (Figure 3).Each individual in the population is represented as chromosome (a sequence of genes) and encoded in binary for a binary-encoded GA, which is used in this study.For the evolution of individuals, crossover combines two individuals and produces two new individuals (offspring); on the other hand, mutation flips a bit in the gene of a chromosome [67].In this work, a population of GA is mapped to the set of test cases, and the chromosome is mapped to the concatenated value of test inputs.(++v, v)

Description of Proposed Approach
The flow chart in Figure 4 illustrates the functionality of our proposed approach tdgen_gamt that begins with reading the source code of the original program and outputs: list of methods in the original program, the number of input variables for method under test for random initialization of the population.Here, the population refers to a collection of test cases that are forwarded for fitness evaluation over the artificial mutants (the faulty version of a subject program), and the fault matrix (Figure 5) also gets updated in each iteration.The approach then proceeds with the selection of parent test cases for reproduction, which in turn applies intensification and diversification, if not converged.We perform intensification (crossover) when there is a chance of improvement in test case locally, otherwise, perform diversification in the form of mutation that intends to diversify the solution globally.At the end of each iteration, if tdgen_gamt converges, it stops functioning and provides the non-redundant test suite with mutation coverage information.As shown in Algorithm 1, our approach tdgen_gamt creates a random solution of test inputs, i.e., pop, which is initially empty.In this paper, each test input can have its value in the range [−10, 110].
Here, we use a binary-coded Genetic Algorithm and perform crossover and mutation on the binary string.Therefore, the integer test inputs are converted in binary using (8× number of inputs) bits for reproduction (8 bits are sufficient to represent each test input in the range [−10, 110]).There are variants for chromosomes encoding in GA e.g., gray, binary and, real, and each has its own advantages and disadvantages [68][69][70].Binary encoding is beneficial to include a sudden change in the population of solutions, which is desirable in the current study to diversify the population for increasing the chances of detecting live mutants.We evaluate the quality of the test suite by executing it against the mutants (each test case is executed over each mutant in the set) (Section 2.2.1), and the fitness of each test case is recorded in the fault matrix (Figure 5).For example, we have n number of mutants (M 1 − M n ) and 4 test cases uniquely identified by test case IDs (T 1 − T 4 ).Each test case has its own fitness, complexity, and mutant detection information in the form of 0 and 1 which in turn express live and killed mutant, respectively.

Algorithm 1: The Proposed Approach tdgen_gamt
Input : Subject program under test S, Initial population size p_size, Iteration limit itr_limit, Selection criteria sel_parent%, Generated non-equivalent mutants M Output : non-redundant test data set, T

Fitness Evaluation
In this study, we aim to generate test cases with maximum effectiveness (measured in terms of mutation score, which is the fault-finding capability of a test case) along with minimum test case complexity (TCC).Therefore, the inverse of TCC appears as part of fitness evaluation (Equation (1)).Furthermore, there is no correlation between test case effectiveness and TCC since effectiveness depends only on the value of the test case that is used for program execution and mutant detection.Algorithm 2 explains the calculation of fitness function and mutation score.Here, TCE (Henceforth, TCE refers to Test Case Effectiveness and measured using mutation score (MS)) is measured using the fault-finding capability of a test case, i.e., mutation score that has been frequently used in the literature.TCC (TCC refers to Test Case Complexity in terms of time-steps) is the test case complexity in terms of time-steps measured in microseconds using an in-built library of Java (java.lang.reflect.Method).Here, TCC is not source code complexity or cyclomatic complexity.The latter is used for path coverage while we are generating the tests for mutant coverage.We assume that a complex test case might take more time for execution than the less complex one.Two test cases may detect the same faults and have the same mutation score but definitely, differ in complexity (e.g., a test case with value 100 will run "for-loop" 100 times and take more steps for execution than another test case with value 1 or less than 100.In this case, a test case with lower execution time-steps is selected first to be kept in the fault matrix if both detect the same faults).The designed fitness function intends to select better tests with minimum cost.At any time that fitness is evaluated, redundant tests are also removed, and consequently, fault matrix gets updated.
A redundant test case detects the faults previously identified by another test.Such test cases do not contribute to testing and only increase the cost [71].Let us take an example to understand the concept.Assume there are two test cases T 1 and T 2 .T 1 identifies faults M 1 and M 2 .If test case T 2 only detects fault M 2 .Then, we say that execution of T 2 is not required because both faults can be killed by executing only T 1 ; therefore, test case T 2 is redundant and can be deleted from the test suite without losing the effectiveness of the test suite.Removal of such tests leads to an efficient test suite.The pseudocode in Algorithm 3 illustrates how redundant tests are identified and removed.

Algorithm 3: RemoveRedundantTests (T)
Input : Set of test cases with its fitness information T Output : Set of non-redundant test cases T /* T is sorted in ascending order of fitness score using Collections.sort()function */ 1 T ← Collections.sort(T);

Diversification vs. Intensification for Reproduction
While generating solutions, search-based algorithms (evolutionary algorithms) perform two operations, i.e., intensification and diversification [72][73][74].In intensification, it searches the neighborhood search space and exploits the solution by selecting the best of these local solutions.Diversification, however, explores the search space globally and tries to diverse the solution.In this study, during every successive iteration, the current population of tests evolve based on the fact that it might be improved in the local optimum (intensification) or needs diversification globally.In GA, intensification favors the current population and perform crossover to find the better offspring in terms of fitness [72].Two chromosomes exchange their properties at a random position and create two new offspring.In this study, we perform uniform crossover (Algorithm 1) on the parent population with 0.5 random probability (this type of crossover is recommended for the chromosomes with moderate or no linkage among its genes [67], which suits to this study).We also ensure that each pair of test case participate in this phenomenon only once.Therefore, n parent test cases generate n new offspring and thus reduces the time and space complexity.Each offspring is then evaluated for fitness using Equation (1).We then check for the offspring, able to kill some live mutants or better than its parent population.These crossover test cases are merged with the previous population and the process is repeated till convergence.However, if crossover test cases fail to kill some live mutants or is not better than its parent, then, diversification in the form of one-point mutation (Algorithm 1) is preferred to increase the probability of detection of live faults as well as reduces the risk of identifying an already killed fault.Mutation is applied to all crossover test cases.Here, a single bit is flipped from 0 to 1 or vice versa at a random position between 0 to the length of the gene in a chromosome.The intention behind this strategy of intensification and diversification is only to improve the effectiveness of the test suite iteratively.We present an example (Figure 6) for ease in understanding the idea (a detailed example is given in Section 2.3).Let we have two test cases T 1 , T 2 with their killed mutants (M 1 , M 2 , M 6 ) and (M 2 , M 4 ) respectively.Consider case 1, parent test cases T 1 , T 2 are improved via intensification (crossover); however, offspring from crossover (C 1 , C 2 ) is not more effective than parent test case, but C 1 , C 2 kill live mutant M 8 and M 5 respectively.It makes C 1 and C 2 valuable in the entire population.Meanwhile in case 2, C 1 and C 2 do not enhance the effectiveness of the complete test suite, therefore C 1 , C 2 are diversified using mutation.This may produce effective test cases.

Population Replacement Strategy and Elitism
In general GA, all the individuals of a current population are removed and new individuals for the new population are derived using reproduction over the current population.By doing so, it may lose the best individuals due to its stochastic nature.Therefore, some best solutions are retained as elitist solutions and guarantees that the quality of the solutions will be improved iteratively [67,75,76].We use the benefits of elitism to sustain all non-redundant individuals of the previous population which can be 10%, 20% or 50% of the entire population depending on fault-finding behavior of the test cases during execution (Algorithm 1).Usually, GA works on the principle of human reproduction.In this, the older and less fit solutions get dead in each iteration and some of the fitted solutions are kept as the elitist solution.With time, these solutions lose their fitness and are replaced with the new ones.However, in testing, a test case will have the same fitness throughout the process.Test case fitness is evaluated on all the mutants and during the process, no new mutant is added.Thus, if we get a test case that is good in finding the fault, we can preserve it and can cut the cost of re-generating a similar test case which is already created in the previous iteration.Fitness evaluation for such test cases is also not required in successive iteration, we can use the preserved fitness.That is why we save all the non-redundant and valuable test cases as the elitist solution in each iteration regardless of their fitness (best or worst).A test case in an adequate test suite is considered to be effective and relevant if it succeeds in killing a resistant or hard mutant irrespective of its fitness [77].This elitism strategy may minimize the cost of finding the optimum solutions in a fewer number of iterations.

Convergence
In general, search-based algorithms stop functioning at convergence based on some criteria such as reaching a time limit, iteration, or coverage.The proposed approach tdgen_gamt converges when all the non-equivalent mutants are killed by the test suite, i.e., achieving 100% mutation score.However, it may be possible that some mutants are not killed even after a significant number of iterations; such mutants are too hard to be identified and might be killed only by specific tests, e.g., in the case of equilateral triangle (all inputs must be equal).To avoid such a situation, another criterion is also defined, i.e., iteration limit.The approach stops functioning when any of the criteria satisfy (Algorithm 1).

A Detailed Example
To show how the approach works for test generation, we take an example for generating tests for a single input problem (Table 2).Let the population size is 8 and the number of non-equivalent mutants is 10 (M 1 − M 10 ).Initially, eight test cases (T 1 − T 8 ) are randomly created and executed against all the mutants (M 1 − M 10 ).For each test case (T 1 − T 8 ), fitness, and mutation score are evaluated.In iteration 1, test T 1 detects 5 mutants out of 10, therefore its mutation score is evaluated as 50.We also checked the status of each test case as redundant (R) or non-redundant (N).After fitness evaluation, best tests (T 1 , T 7 ) are selected to perform crossover (intensification) and generate two new offspring i.e., C 1 , C 2 .After their fitness evaluation, it is found that test case C 1 is redundant but C 2 is non-redundant.In this case, we say that intensification is worthy and adds one new test case in the population.At the end of the iteration, non-redundant crossover test cases are merged with the previous non-redundant solution and total five test cases are obtained (T 1 , T 5 , T 6 , T 7 , C 2 ).We then check for convergence, mutation score of the complete test suite is 90% that is <100.We then again re-iterate the complete process.At the beginning of each iteration, the size of the population is maintained, and 3 more random test cases are added in iteration 2. We then perform crossover on T 1 and T 7 which lead to C 3 , C 4 .Here, it is found that both crossover test cases are redundant and could not kill any live mutant.Here, we say that intensification could not produce valuable test cases.Therefore, we try to diverse the crossover population i.e., C 3 , C 4 using mutation with the possibility to obtain the desired test cases.This leads to MT 1 , MT 2 and only MT 1 is found to be non-redundant.The leftover live mutant M 10 is killed by this new offspring MT 1 .Now all the non-redundant new offspring and previous population are kept together and it is found that all the mutants are now successfully detected by these test cases (T 1 , T 5 , T 7 , C 2 , MT 1 ).At convergence, our approach tdgen_gamt stops and returns the non-redundant test suite.

Experimental Setup
This section explains the experimental settings of different scenarios presented in this study.First, it presents the subject programs (Section 3.1), along with how mutants are generated (Section 3.2).Then, we discuss the evaluation metrics used to measure the efficacy of our proposed approach.

Subject Programs under Test
We conduct the empirical experiment on 14 Java programs used widely in mutation testing and test data generation [10,18,35,36].The number of inputs varies between 1-6 in the selected programs.S2 and S5 have the minimum number of inputs, i.e., one.Furthermore, S8 and S9 programs have the largest set of inputs, i.e., five, six, respectively.The LOC of the selected programs ranges between 19-153 and their specifications are listed in Table 3.In this study, test cases are only generated for the considered method which is the main calling method in the corresponding subject program.

Mutants Used
To check the adequacy of the test data, artificial faults are created using MuJava [84] instead of real faults.The procedure of mutant generation is illustrated in Figure 7. MuJava creates method-wise mutants (methods in the subject program) and test cases are generated for the driver method (main calling method) in this study.Inspired by the reduced cost of selective mutation, only few mutation operators (SDL-Statement Deletion, ODL-Operator Deletion, CDL-Constant Deletion, and VDL-Variable Deletion) are incorporated for mutant generation (Figure 1) because all operators are not cost-effective and produce a higher number of equivalent mutants and redundant mutants [20].According to Untch [66], these mutants are powerful and generates fewer equivalent mutants and require highly effective tests for fault detection [85,86].It is also noticed that ODL mutants are a super-set of VDL and CDL mutants.Therefore, only SDL and ODL mutants are kept for further use.All the mutants, including SDL and ODL are again analyzed for equivalence detection using 1000 random test cases, and live mutants are further analyzed manually for equivalent mutant detection.In this study, we created two types of artificial faults for method under test of subject programs.The first set contains the mutants (SDL, ODL) which are used for test case creation while another set (traditional mutants) is used for evaluating the results and for comparison with the state-of-the-art techniques (Figure 1).Traditional mutants [14] are generated and found to be 80% more than the set 1 mutants.The executable mutants for the considered method under test of each subject program are listed in Table 4.Total mutants generated by each of the SDL, ODL and traditional operators are 283, 385, and 3452 respectively; out of these, 5%, 2%, and 7% mutants are found to be equivalent for each operator respectively (Table 5).Some of the mutants are not executable, throw an exception, and stick in infinite loops.Such mutants are deleted before test generation.Non-executable mutants are 3%, 1%, 6% for SDL, ODL and traditional mutants respectively (Table 5).

Evaluation Metrics
To compare the effectiveness of our approach with Evosuite and Initial Random tests, different evaluation metrics are considered and listed in GQM template (Table 6) [27,87].Answers to the questions are discussed in Section 4.

Experiments
We apply tdgen_gamt to generate and select only non-redundant test inputs and repeat this 50 times to alleviate the consequences of random variation.To quantify the efficacy of our approach, various statistics measures are recorded, including mutation score, test suite size, and test case efficiency.Experiments are carried out on a 32-bit system with core i7 processor and 4GB RAM.To automate the approach tdgen_gamt, Eclipse IDE for Java developer Mars version is used with JDK8.

Results and Discussion
This section discusses the results of each research goal listed in Table 6.The performance of our approach is also evaluated and discussed in the succeeding Section 4.1.To evaluate the efficacy of the proposed approach, it is compared with Evosuite and Initial Random tests in terms of various aspects i.e., test suite effectiveness, test suite size, and test case efficiency.For a fair comparison, Evosuite test cases are executed only for the method under test.

Performance of the Proposed Approach tdgen_gamt
Our proposed approach creates the optimized test suite using elitist GA (Algorithm 1).We collected all the results related to achieved mutation coverage, test generation time (in seconds), number of iterations required, final test suite size and the total number of fitness evaluations performed till convergence and detailed in Table 7.For each of the problem, our approach takes [0.1, 10.2] s on average for creating a test suite by successfully detecting [88-100%] mutants.Fitness evaluations are also recorded to show that only  test cases are evaluated till convergence.On average, only 10 test cases out of 238 tests are found to be non-redundant and valuable.This shows that our approach generates low-cost test suite by repetitive deletion of obsolete tests for the method under test of each subject program.tdgen_gamt takes advantage of intensification and diversification for finding the solution in fewer number of iterations as well as time.This seems to be useful in the area of search-based optimization by balancing between the control parameters.

Effectiveness Comparison with Evosuite and Initial Random Tests
This section evaluates whether the proposed approach with intensification and diversification can guide the search process to obtain the desired outcome.Evosuite is a state-based evolutionary test suite generation tool implemented by Fraser and Arcuri [10], where tests are created using weak mutation coverage criteria.There is no provision to find out the information about the number of mutants generated and mutation score achieved by the resultant test suite.On the other hand, our approach tdgen_gamt reports above-listed measures along with test suite size and fitness evaluations.Both approaches also differ in fitness evaluation and mutation operator selection.Evosuite calculates the fitness using the distance to calling function, distance to mutation and mutation impact, while in tdgen_gamt, it is computed using test case effectiveness and its complexity in terms of time-steps.Evosuite internally generated mutants for eight mutation operators [10] (Table 8) while we create mutants only for delete operators (SDL, ODL).

S. No
Mutation Operator Delete Call 3.

Replace Variable
The results by taking only the average mutation score into consideration show that Evosuite tests (With greater test size than our proposed mechanism) outperform the other two by 9% (our approach) and 17% (Initial Random Tests) referring all traditional mutants.Evosuite needs on average 12 test cases to identify 96.35% traditional mutants while our approach requires only 10 test cases to kill 87.83% traditional mutants.As stated in [88], a technique is recognized more effective when it generates tests, killing more faults using an equal size test suite.Considering the size of the test suite along with mutation score, we also analyze the effectiveness of each approach and named it as test case efficiency (TCEF) (Equation ( 2)).The discussion related to TCEF is given in Section 4.3.Overall, out of 2993 executable traditional mutants, Evosuite could detect 2884 mutants, tdgen_gamt could kill 2629 mutants while Initial Random tests killed only 2383 mutants.For some of the programs (S8, S9, S10), Initial tests could detect more traditional mutants than tdgen_gamt.For such cases, removing redundant tests found to be harmful.As a test case is redundant for a mutant set but may be non-redundant for some other mutant set.Therefore, Initial tests perform better for traditional mutants for S8-S10.On the other hand, tdgen_gamt could achieve high fault exposure when considering SDL, ODL faults.
We run each of the approaches 50 times and record average and median mutation score achieved against traditional as well as SDL, ODL mutants (Table 9).Standard deviation is also measured to show the variability among data from the mean.For each of the program, the standard deviation is found to be minimum in case of tdgen_gamt, which in turn suggests that our approach performs similarly and produces stable results when executed for any number of times.Meanwhile, for Evosuite, test suite deviates from the mean in the range [0, 9].For our approach, average and median are found to be equal for most of the programs that further demonstrate that it produces the test suite in a symmetric fashion.During experimentation, we calculate the mutant detection rate of each mutant and analyze which approach is successful in detecting stubborn mutants [89].Results (Table 10) reveal that out of 2993 executable traditional mutants, approximately 2-3% artificial faults could never be killed by any of the approaches that further indicates our approach conclusively perform good at recognizing the traditional mutants.Meanwhile, in the case of SDL, ODL faults, our approach successfully detected approximately all the mutants out of 634 executable mutants.We evaluate the efficiency and cost by recording the statistics for test case efficiency (Equation ( 2)) (TCEF), and test suite size (TSS) (Table 11).
Test Case Efficiency(TCEF) = TSE / TSS (2) Here, TSE (Henceforth, TSE refers to test suite effectiveness) is the effectiveness of complete test suite i.e., mutation score of the test suite.TSS (TSS refers to Test Suite Size) is the size of the test suite i.e., the number of test cases in the resultant test suite.
Figure 8 demonstrates the improvement in efficiency and reduction in test suite size for our proposed approach over Evosuite tests.We also analyze how much initial tests are genetically improved using tdgen_gamt.From the results (Table 11, Figure 8), it is found that tdgen_gamt generates 13%, 205% more efficient and 18%, 60% reduced test suite than Evosuite and Initial Random tests respectively.Removing non-redundant test cases greatly minimizes the resultant test suite size, which in turn increases TCEF (Equation ( 2)).However, efficiency is not improved when compared with Evosuite and varies in the range [−15, −2] for some of the subject programs.It might be the reason that Evosuite generates the test cases using some of the traditional mutants while our approach uses only SDL, ODL mutants for test creation.Therefore, we can say that the test suite from both approaches will definitely perform better for the mutants they use for fitness evaluation.

Stability Analysis
Reproducibility and conclusion stability are two significant characteristics of any new proposed approach.To validate the stability of our approach, we repeat 50 experiments and collect the average results (Table 12).When considering full mutation coverage, our tests perform best for five subject programs; however, is found to be worst for S6, S12, S13, and S14 only.It might be a reason that these programs require some specific test cases, e.g., to identify an equilateral triangle, all inputs must be equal.On average, it could not achieve the full coverage for [36,100]% times for six out of 14 programs.For each program, test data always revealed more than 90% mutants except for subject program S12.For each of the 50 runs, the standard deviation is too minimum and very close to 0 (Table 9), therefore, it can be stated that it is stable.According to the results, it can be declared that our algorithm can generate the test suite with higher mutation coverage.

Selection of Control Parameters
The performance of GA primarily depends on its parameters, i.e., population size, number of iterations, reproduction operator probability and method, and convergence criteria.We investigate the impact of these parameters on different efficiency measures.The population size symbolizes the number of individuals present in each population.It is expressed using an array of bits, i.e., binary-coded values of each input.For example, in the case of Power program, each individual is encoded as an array of 8 bits, each byte representing the value of inputs.Our approach is replicated for different values of control parameters (Table 13) and repeated 50 times due to the stochastic nature of GA.Impact on several measures, i.e., test generation time and coverage are recorded and illustrated in Figure 9.For each of the 14 subject programs (method under test), we execute tdgen_gamt 50 times for 24 different parameter values (24 × 14 × 50 =16,800).For each of the subject programs, the size of the population depends on the number of input variables.Performance is evaluated for four different sizes of the population, and we noticed that increasing the population size has a remarkable impact on test generation time.Handling a too-large population is more time-consuming.Our approach tdgen_gamt keeps mutation coverage and test generation time in the range of [79,100] and [0.06, 4] (seconds) respectively, for the minimal population size; while for higher population, there is little improvement in coverage but it drastically increases the cost (Figure 9).Therefore, we can state that with 10×Inputs population, tdgen_gamt perform better when considering both measures, i.e., time and coverage.
Considering three different iteration limits, i.e., 30, 50, and 100, we find a drastic improvement in test effectiveness (mutation score) for only subject program S12 (Figure 9).It indicates that within 30 iterations, our approach successfully killed most of the mutants.However, there is a continuous increment in test generation time for S6, S12, S13, and S14 programs.Considering both the parameters, the proposed approach is recommended to run for 30 generations.
To select how many parent test cases participate in reproduction, we experiment with the best 25% and 50% of test cases.Results (Figure 9) reveal that there is no improvement in mutation score except for S5.Therefore, we state that the fittest 25 % test cases are sufficient for parent selection.

Limitation of the Proposed Approach
The choice of mutation operators can significantly impact the test suite size and its effectiveness.A test case may be redundant for a set of mutants while it may not be so for other sets of mutants.Removing redundant tests could miss out some valuable test cases which otherwise may be good at detecting other types of faults.To establish a preponderance of the proposed approach, test cases were executed against traditional mutants also.The size and input type of the subject programs is a limitation of our study.At present, this study considers fixed integer inputs; however, it would be relevant to extend it for other data types, including dynamic arrays and strings.Moreover, further experimentation with varying and large program size can also be examined.

Conclusions
Test data generation is a time-consuming and critical process, which can be optimized using different search-based algorithms satisfying specific coverage measures.In the literature, mutation coverage is considered to be more powerful than other measures.However, taking mutation coverage as a stopping criterion may result in a large test suite.In this paper, a GA with the objective of low-cost mutation coverage (tdgen_gamt) is implemented for generating the test data.To generate the highly qualified test for fault detection, a fitness function is also proposed that maximizes the effectiveness as well as minimizes the complexity of each test case.Each test case in the solution set is different and non-redundant from others in killing the faults.To preserve the valuable test cases in each iteration, the concepts of 'elitism' along with 'intensification' and 'diversification' are employed; it speeds up the process of convergence.
A good number of experiments are performed on widely used 14 Java programs to tune the control parameters and to mitigate the effects of random generation.We compared the results with a popular automatic tool in academia Evosuite and Initial Random tests.These three techniques do not perform equally in identifying the mutants with low-cost test data.Major findings of this experimental work are listed below.

•
Empirically, the obtained test suite could detect on average 87.83% (tdgen_gamt), 96.35% (Evosuite), and 79.6% (Initial Random tests) executable traditional faults irrespective of test suite size.This anomaly of preponderance can be explained on the account of measuring the effectiveness of the approach by considering the size of the test suite (test case efficiency).The proposed approach detects the maximum number of mutants with fewer and less complex test cases.

•
Additionally, we also analyze the detection rate of each fault type from each of the approaches.The results report that tdgen_gamt could perform equally at finding the stubborn mutants.Meanwhile, only 0.3%, 1.1%, and 1.9% SDL-ODL mutants are identified as stubborn by tdgen_gamt, Evosuite, and Initial Random tests, respectively.This indicates tdgen_gamt successfully killed approximately all the mutants and may easily detect stubborn mutants.

•
Also the removal of redundant tests raises the efficiency of the approach.In particular, based on the conducted study, our approach tdgen_gamt generates 13%, 205% more efficient and 18%, 60% reduced test suite than Evosuite and Initial Random tests respectively.A set of test cases that is redundant for one set of mutants may not be redundant for another set of mutants.

•
During reproduction, crossover operation is performed only once on each parent test case.This choice of reproduction operator also lowers the time complexity of tdgen_gamt.

•
The use of elitism helps in fast convergence.

•
The suggested fitness function appropriately guides the search process by finding the highly effective and less complex test cases in terms of finding the faults.

•
Our approach successfully qualifies the stability test and fails only 5% (on average) in identifying more than 90% mutants.

•
Use of low-cost mutation operators (produces 80% fewer mutants than traditional mutants) makes it easily adaptable by others.

Figure 2 .Figure 3 .
Figure 2. Illustration of some mutants on a sample program.
, T2, C1, C2, MT1 and MT2 are the test cases while M1-M10 are mutants.If a test case identifies a mutant then that mutant is called as killed otherwise live.

Figure 7 .
Figure 7. Step by step procedure for creating the mutants.

Figure 8 .
Figure 8. Efficiency and cost comparison (a) compared Evosuite to tdgen_gamt, (b) compared Initial random tests to tdgen_gamt.

Figure 9 .
Figure 9. Impact of different parameters on mutation score and test generation time.(a,b) population size (c,d) number of iterations and, (e,f) parent selection criteria.

Table 1 .
Summary of related work in test data generation.
illustrates a few examples of such mutants.(f) Killing a Mutant: A test t ∈ T (Test Suite) kills a mutant m ∈ M (set of Mutants) if the execution of t can distinguish the behavior of the original program s and mutant program m.

Table 2 .
A detailed example of how tdgen_gamt works.

At the end of Iteration 2: Non-Redundant Test suite (Mutation and Elitist Test Cases)
Here, #M: Number of Non-equivalent Mutants, #KM: Number of Killed Mutants, R: Redundant, N: Non-redundant.

Table 3 .
Subject programs with their methods under consideration.

Table 4 .
Description of executable mutants for method under test.

Table 5 .
Description of equivalent and exceptional mutants for method under test.

Table 8 .
List of mutation operators implemented in Evosuite.

Table 9 .
Effectiveness of test suites.: Each first row displays average ± standard deviation while second row illustrates median value for 50 runs. Note

Table 10 .
Subject program wise stubborn mutants for method under test.

Table 11 .
Efficiency and cost measures.

Table 11 .
Cont.Note: Each first row displays average ± standard deviation while second row illustrates median value for 50 runs.

Table 12 .
Number of times failed to achieve 100%, above 95%, above 90% mutant coverage (results are recorded in percentage of failure times).