4.1. Subject Programs and Test Suite
Eleven C programs whose number of executable statements (eLOC) range from 135 to 9932 with an average of 3137.8 eLOC were selected for our study. The eLOCs of these programs were counted by SLOCCount 2.26 [
29]. These programs were selected because they have been used in fault localization experiments performed by other researchers [
25,
30,
31,
32]. These programs were downloaded from the Software-artifact Infrastructure Repository (SIR) [
33]. These 11 programs are (1) three UNIX utilities, namely,
flex,
grep, and
sed; (2) one real life application
space; and (3) seven small programs usually referred to as the
Siemens suite. The following is a description of these 11 programs.
Flex is an UNIX utility to generate lexical analyzer by scanning a lex file with definitions, rules, and user code contained. The generated analyzer then transforms the input stream into a sequence of tokens.
Grep is a pattern matching engine. Given one or more patterns and some input files for searching, it outputs the lines that match one or more of the patterns.
Sed is a stream editor to perform operations on the input stream, such as replacement, deletion, and insertion on a specific line or the global text.
Space is an interpreter for an array definition language (ADL) to check the ADL grammar and specific consistency rules. If the ADL file is correct, space outputs an array data file; otherwise, the program outputs error messages.
Print_tokens and print_tokens2 are two lexical parsers used to group input strings into tokens and identify the token categories. The main difference between these two programs is that print_tokens uses a hard-coded DFA, while print_tokens2 does not.
Replace is a program of regular expression matching and substitution. It replaces any substring matched by the input regular expression with a replacement string, and outputs a new file.
Schedule and schedule2 are used to schedule the priority in three job lists. Schedule is non-preemptive and schedule2 is preemptive.
Tcas is used to avoid air accident by detecting on-board conflict through radar system and providing a resolution advice, such as climb, descend, or remain on the current trajectory.
Tot_info takes a set of tables as input and outputs the Kullbacks information measure, degrees of freedom, and possibility density of a distribution for each table and the summary of the entire set.
Since previous fault localization investigations on the effectiveness of risk evaluation formulas have the assumption that the faulty program contains one fault, e.g., [
14], we used the same assumption in our empirical study. Related to the faulty versions of these 11 programs on SIR, some of them have multiple faults, and hence, were excluded from our experiments. For the three UNIX utilities, it is unfortunate that all their faulty versions contain multiple faults. Hence, we manually generated five faulty versions for each of these UNIX utilities and each faulty version contained only one single fault. The
space program has 38 faulty versions with real faults. Only seven out of these 38 faulty versions contain single faults, and hence were selected for our experiments. The
Siemens suite includes seven small C programs with various seeded faults in their faulty versions. Some of these faulty versions were excluded from our experiments as they contain more than one fault. Most of the faulty statements are conditional statements and assignment statements. For example, the faulty statement of
print_tokens2 v6 is “
if (isdigit(*(str+i+1)))”, which should be “
if (isdigit(*(str+i)))”. A faulty version that is selected or manually generated for our empirical study is referred to as a “selected faulty version” whenever it is clear from the context.
Table 2 summarizes the information of these programs for our empirical study. As discussed earlier, the eLOC column reports the number of executable statements collected by SLOCCount 2.26 [
29]. The mutants generated from the UNIX utilities, the selected faulty versions of
space and the
Siemens suite, the number of manually mutated or selected faulty versions, and the total number of faulty versions provided in SIR are listed in the third column. The size of each test suite is listed in the second to last column—obtained from the individual “universe” test plan. In summary, our empirical study had 15 mutants for the three UNIX utilities, seven faulty versions for the
space program, and 26 faulty versions for the
Siemens suite. The size of the corresponding test suite ranged from 441 to 13,550 test cases, with an average of 3782.9 test cases per selected faulty version.
4.2. The Empirical Process
We performed an empirical study to compare the effectiveness of five maximal formula groups,
,
, GP02, GP03 and GP19. It has been proven by Xie et al. [
14] that all formulas in the same maximal formula group have the same performance. Therefore, it is sufficient to select only one representative formula from each group for study. In other words, if another formula is chosen, we would still observe the same results. We chose Naish1 (abbreviated as N1) from
and Wong1 (abbreviated as W1) from
because they are the simplest formulas in their own groups. We chose GP02, GP03, and GP19 from the last three groups because there is only one maximal formula contained in each group.
The empirical study aimed to compare the performance of these formulas in a pairwise manner. As a result, we have 10 comparison pairs. For ease of reference, we use CP1–CP10 to denote these 10 pairs, in which CP1 denotes the comparison pair between N1 and W1, CP2 for N1 and GP02, CP3 for N1 and GP03, CP4 for N1 and GP19, CP5 for W1 and GP02, CP6 for W1 and GP03, CP7 for W1 and GP19, CP8 for GP02 and GP03, CP9 for GP02 and GP19, and finally, CP10 for GP03 and GP19.
For the selected 11 C programs, there were, altogether, 48 faulty versions selected for the empirical study. For each selected faulty version, it was executed with respect to its corresponding test suite. All of the executions of faulty programs with respect to their test suites were performed using Fedora 20 64-bit virtual machine with one processor and 4GB of memory. After the execution, all information related to the program spectrum was recorded; namely, , , , and , for every statement in the selected faulty version. Once all information had been collected, we then applied the five maximal risk evaluation formulas, each from its own formula group, to calculate the risk values for each statement and get the corresponding ranking lists. As a reminder, statements with the same risk value were then ordered according to their corresponding statement IDs in the faulty program. For each maximal formula R in {N1, W1, GP02, GP03, and GP19}, we then divided its ranking list into three mutually exclusive subsets; namely, , , and . We then calculated the expected faulty location, , of the risk evaluation formula R using the formula .
Once we calculated all the expected faulty locations of these five formulas, we performed our analysis on the pairwise comparison between these formulas. For each comparison pair of formulas and , we conclude that performs better than on that instance of the selected faulty version when . When , the two formulas have the same performance. When , we conclude that performs better than .
4.3. Experimental Results and Analysis
Table 3 lists all the results of these 10 comparison pairs. For example, “N1” in the cell of row “
tcas v1” and column “CP1” indicates that “N1 performs better than W1”. Some cells in
Table 3 have “same” meaning that the two risk formulas in the comparison pair have the same performance.
Since the performance of maximal formulas in each comparison pair varies with different programs,
Table 4 summarizes the results as the percentage of “
which performs better than
”, “
which performs the same as
”, and “
which performs better than
” for each comparison pair and each individual subject program. For example, in the row of “CP1” and the column of
flex, out of the five mutant programs, there are four mutants for which N1 performs better than W1 and there is only one mutant for which W1 performs better than N1. Hence, we can conclude that, among these five mutants of
flex, N1 performs better than W1 more often. For ease of reference, we use the term
more-frequently-better to reflect this situation. In other words, we conclude that N1 performs more-frequently-better than W1. Formally, for two risk evaluation formulas
and
, we say that
performs more-frequently-better than
, denoted as
, if, among all selected faulty versions, the number of times that “
performs better than
” is higher than that of “
performs better than
”.
From
Table 4, we have the following observations.
- (1)
N1 has a higher chance to perform better than W1, GP02, GP03, and GP19.
For CP1, we observe that N1 has a higher chance to perform better than W1 for all programs except grep, sed, and schedule2. The percentage values of range from 33.3% to 100% with an average of 74.6%, whereas those of range from 0% to 66.7% with an average of 24.3%. Hence, we can conclude that N1 performs more-frequently-better than W1.
Similarly, for CP2, CP3 and CP4, we can also conclude that N1 has a higher chance to perform better than GP02, GP03, and GP19, respectively. However, we have to point out an interesting observation of the N1 and GP03 pair. For N1 and GP03, the percentages of range from 0% to 80% with an average of 16.0%, whereas those of range from 0% to 40% with an average of 8.1%. In fact, N1 and GP03 have the same performance with an average of 75.9%.
- (2)
GP03 has a higher chance to perform better than W1, GP02 and GP19.
For CP8, GP03 has a higher chance to perform better than GP02 for all programs except flex, replace, schedule, and tcas. In fact, for replace and schedule, the chances for and that of are the same; both are 0% for schedule and 33.3% for replace. The percentage values of range from 0% to 80% with an average of 44.9%, whereas those of range from 0% to 100% with an average of 27.1%. Hence, we can conclude that GP03 performs more-frequently-better than GP02.
Similarly, for CP6 and CP10, we can also conclude that GP03 performs more-frequently-better than W1 and GP19 since the average percentage values of and are 64.0% and 63.9% respectively.
- (3)
GP02 has a higher chance to perform better than W1 and GP19.
For CP5, we observe that GP02 performs more-frequently-better than W1 for all programs except grep, sed and schedule2. The percentage values of range from 0% to 100% with an average of 66.8% whereas those of range from 0% to 100% with an average of 31.4%. Hence, we can conclude that GP02 performs more-frequently-better than W1.
Similarly, for CP9, we can also conclude that GP02 performs more-frequently-better than GP19.
- (4)
W1 has a higher chance to perform better than GP19.
For CP7, W1 performs more-frequently-better than GP19 for all programs except flex, grep, sed and schedule2. In fact, for flex and grep, the chances for and that of are the same; both are 40% for flex and 40% for grep. The percentage values of range from 33.3% to 100% with an average of 66.1% whereas those of range from 0% to 66.7% with an average of 27.8%. Hence, we can conclude that W1 performs more-frequently-better than GP19.
In summary, it is very likely to observe that N1 ⤏ GP03 ⤏ GP02 ⤏ W1 ⤏ GP19.
An interested reader may argue that the result of “N1 ⤏ GP03 ⤏ GP02 ⤏ W1 ⤏ GP19” is based on the percentage of the selected faulty versions per individual program, and we compare them by taking the average of these percentages. This may be unfair because, for
tcas, there is only one selected faulty version. Only
tcas v1 was selected as the faulty version since the scale of
tcas subject program is too small, and most of the faulty versions are of the same type. Hence, a 100% of
in
tcas may actually add some advantages to N1 compared with other individual programs. So, one may then ask what the result would be if we used the total number of selected faulty versions to compare between these risk evaluation formulas. The answer to this question is, “We would have the same result.” We are going to discuss this in the rest of this paragraph.
Table 5 shows, for each comparison pair
and
, the total number of selected faulty versions that fall into the results of “
better”, “
better”, and “
same as
”. For example, in the row of CP3 (the pair of N1 and GP03), there are nine selected faulty versions that fall into “
better”, five in “
better”, and 34 in “N1 same as GP03”. Hence, N1 performs more-frequently-better than GP03. From
Table 5, we have the following observations:
- (1)
For CP1–CP4, N1 has a higher chance to perform better than W1, GP02, GP03, and GP19.
- (2)
For CP8, CP6, and CP10, GP03 has a higher chance to perform better than GP02, W1, and GP19.
- (3)
For CP5 and CP9, GP02 has a higher chance to perform better than W1 and GP19.
- (4)
For CP7, W1 has a higher chance to perform better than GP19.
As a result, we have the same observation. That is, N1 ⤏ GP03 ⤏ GP02 ⤏ W1 ⤏ GP19.
Since N1 and W1 are representative formulas from their original maximal formula groups and respectively, we can conclude further that any maximal formula in the group performs more-frequently-better than GP03, which in turns performs more-frequently-better than GP02, which in turns performs more-frequently-better than any maximal formula in the group, which in turns performs more-frequently-better than GP19. That is, we have . Here, we extend the meaning of “⤏” to compare between two maximal formula groups. In other words, if GpA and GpB are two maximal risk evaluation formula groups, GpA ⤏ GpB means where is a risk evaluation formula in GpA and is in GpB. Please be remindeded that all risk evaluation formulas in the same maximal formula group have the same performance.
As discussed in
Section 3.2, when comparing between two risk evaluation formulas
and
, there are five scenarios to characterize all possible cases of
performs better than
.
Table 6 summarises the scenarios covered by each subject program. For example, for the case of “
flex”, among totally 50 comparison results (10 comparison pairs × 5 mutants of
flex), there are 22 comparison results of
obtained from Scenario
, six from Scenario
, five from Scenario
, six from Scenario
, two from Scenario
, and nine comparison results indicating
. As shown in
Table 6,
cases of
fall in Scenario
,
in Scenario
,
in Scenario
,
in Scenario
, and
in Scenario
. According to the observation, Scenario
covers the majority of cases that
.