Performance Analysis of Maximal Risk Evaluation Formulas for Spectrum-Based Fault Localization

: The effectiveness analysis of risk evaluation formulas has become a signiﬁcant research area in spectrum-based fault localization (SBFL). The risk evaluation formula is designed and widely used to evaluate the likelihood of a program spectrum to be faulty. There are numerous empirical and theoretical studies to investigate and compare the performance between sixty risk evaluation formulas. According to previous research, these sixty risk evaluation formulas together form a partially ordered set. Among them, nine formulas are maximal. These nine formulas can further be grouped into ﬁve maximal risk evaluation formula groups so that formulas in the same group have the same performance. Moreover, previous research showed that we cannot theoretically compare formulas across these ﬁve maximal formula groups. However, experimental data “suggests” that a maximal formula in one group could outperform another one (from a different group) more frequently, though not always. This inspired us to further investigate the performance between any two maximal formulas in different maximal formula groups. Our approach involves two major steps. First, we propose a new condition to compare between two different maximal formulas. Based on this new condition, we present ﬁve different scenarios under which a formula performs better than another. This is different from the condition suggested by the previous theoretical study. We performed an empirical study to compare different maximal formulas using our condition. Our results showed that among two maximal risk evaluation formulas, it is feasible to identify one that can outperform the others more frequently.


Introduction
Spectrum-based fault localization (SBFL) [1,2] is an important technique aiming to locate the most possible faulty statements during software testing and debugging processes, which are time consuming, resource intensive, and expensive due to the ever increasing scale and complexity of software [3].To determine the suspicious area in a faulty program, SBFL utilizes the concept of program spectrum.Loosely speaking, a program spectrum is a "certain entity" of the program under debugging together with the execution information, such as testing results and coverage information, with respect to a test suite.The program entity can be of any granularity [1], ranging from a simple statement to a certain basic block.The purpose of SBFL is to identify which program spectrum, actually the program entity contained in the spectrum, is more likely to have faults.For SBFL to be possible, all information related to program spectra have to be collected during the testing of the program.Based on these collected information, debugger uses a risk evaluation formula to calculate the risk values of all program spectra and then ranks these risk values in descending order because the risk evaluation formula has been designed in such a way that the higher the risk value of a program spectrum is, the more likely the program spectrum contains faults.The program debugger then inspects the program spectrum ranking list from top to bottom, and those program spectra with high risk values are regarded as the likely faulty areas [4].Since there are many different risk evaluation formulas proposed in the research literature and different formulas may give rise to different risk values, which lead to different ranking orders, and hence, different "potential" faulty areas, it is necessary to find the most effective formula to "best" locate the faulty program entity.
There are both empirical and theoretical researches on studying the effectiveness of different risk evaluation formulas.For example, empirical investigations have been used to compare the performances of different risk evaluation formulas [2,[4][5][6][7][8][9][10].However, the empirical investigation can never be sufficient and fair enough to compare formulas because of different experimental setups and factors affecting the results.
Theoretical approaches have been proposed to analyze the effectiveness of risk evaluation formulas [11][12][13][14][15][16][17].Detailed definitions and discussions on their work are in Section 2.2.Theoretical analysis proves that (a) there are five groups of maximal formulas [14,17]; (b) formulas within the same group have the same performance [14]; and (c) no formula from one of these groups always comes out ahead of a formula from another group [18].However, it is possible that one formula from one group more frequently (though not always) comes out ahead of a formula from another group.It is too difficult to use a theoretical analysis to show which formula more frequently comes out ahead of another formula, because there are many possible variations.Therefore, we adopted an empirical analysis to see which formula more frequently comes out ahead of another formula.
This then led us to investigate the following research questions 1.For any two maximal formulas from different maximal formula groups, which one can perform better than another one more frequently?2. Is there a maximal formula group that can always perform better than other maximal formula groups more frequently?
Please note that previous research has shown that there are five maximal formula groups and the maximal formulas from the same maximal group have the same performance.Hence, we only need to pick one maximal formula from each of these five groups and compare them in a pairwise manner.Hence, we have 10 such comparisons.We performed an empirical study on 11 small to medium sized C programs ranging from 135 to 9932 source lines of code with an average of 3137.8.We present our findings in this work.
The primary contributions of this paper are summarized as follows: (1) We performed an empirical study to compare between any two maximal risk evaluation formulas, each from a different maximal formula group.(2) We propose using a new condition to compare between two risk evaluation formulas.
This condition is different from other similar empirical and theoretical studies.We use the expected location of the "faulty" statement to compare between the formulas, whereas previous studies used the exact location for comparison.(3) We present and discuss five different scenarios that could lead to the conclusion that one maximal formula can perform better than another maximal formula more frequently using our condition.However, when using the "exact location" condition as in the previous study, these scenarios could not be easily discovered or discussed.
The remainder of this paper is organized as follows.Section 2 introduces the background of SBFL and the maximal risk evaluation formula groups.Section 3 proposes a new condition to compare between two risk evaluation formulas.It also discusses the five scenarios to cover all possible cases that one formula outperforms another one.Section 4 discusses our empirical study, its results, and its threats to validity.Section 5 describes the related work on the effectiveness of SBFL techniques.Finally, Section 6 concludes the paper and discusses future work.

Spectrum-Based Fault Localization (SBFL)
Spectrum-based fault localization (SBFL) uses two significant pieces of information to help localize faults, if any, in a program during the software testing and debugging processes.The first piece of information is the testing result of the program with respect to a test suite.It basically indicates whether a program passes or fails on each individual test case.The second piece of information is the program spectrum, which contains the information about a program entity (e.g., statement, branch or basic block) and its coverage information, such as whether it has been executed or not; how many test cases that execute or do not execute in the program entity such that the program passes; and how many test cases that execute or do not execute in the program entity such that the program fails [3].The program spectrum information, together with the test results, provide a behavioral signature for program execution with respect to a test suite [1].It also describes the characterized information of the program entities obtained from program runtime profiling when the program executes on the test suite.From software testing perspective, a program entity could be a statement, branch, path, function, or basic block, among which statement is the most widely used because of its analysis simplicity [19].The characterized information of a program entity could be the number of times that the program entity has been executed, entity coverage information, and program state before and after the execution etc. [11].Testers and debuggers can make use of the test results and program spectra to identify the program entities that are more likely to cause program failure [1,20].An example is to combine the test results and statement coverage information [21,22].
Given a program PG with n statements (s 1 , s 2 , . . ., s n ) and a test suite TS with m test cases (t 1 , t 2 , . . ., t m ), Figure 1 depicts the relationships among all the essential information for SBFL [14].The n × m matrix MS records the coverage information for each statement in PG with respect to each test case in TS.If statement s i is executed by test case t j , the entry in the i-th row and j-th column of MS will be marked as "1".Otherwise, it will be marked as "0".The 1 × m matrix RE represents the testing result of individual test case in terms of pass (p) or fail ( f ).The 1 × 4 matrix A represents four important quantities for each statement in the program.These four quantities are (1) the number of test cases that execute the statement and fail the program, denoted as e f ; (2) the number of test cases that execute the statement and pass the program, denoted as e p ; (3) the number of test cases that do not execute the statement and fail the program, denoted as n f ; and (4) the number of test cases that do not execute the statement and pass the program, denoted as n p .Obviously, the sum of these four quantities is equal to the size of the test suite TS; that is e f + e p + n f + n p = m [14].The sum of e f and n f is equal to the total number of failed test cases, denoted by F, and the sum of e p and n p is equal to the total number of passed test cases, denoted by P. That is, e f + n f = F, e p + n p = P. Also, 0 e f , n f F and 0 e p , n p P. Finally, the n × 4 matrix MA summarizes these quantities for all statements in program PG, which will be used to calculate the risk values according to different risk evaluation formulas.

Risk Evaluation Formulas
After constructing the program spectrum information, a risk evaluation formula is designed to compute a risk value which is used to indicate the likelihood of a program entity being faulty.The program entity that has a greater risk value will have a higher chance of being faulty.Many classical formulas have been proposed and widely used, such as Tarantula [23], Jaccard [24], Ochiai [7], Wong formulas [9,25,26], Naish formulas [13], and genetic programming (GP) formulas [14,27].The program entity used in these formulas are at the statement level.The program statement with greater risk value is more likely to contain fault.Debuggers then rank the program statements in descending order of their risk values.As a result, debuggers should inspect those program statements appearing on the top of the ranking list.Therefore, it is fundamental to choose the most effective formula to make faulty statements rank as high as possible.
In order to analyze the effectiveness of various formulas, both empirical and theoretical studies have been conducted by many researchers.However, the empirical analysis strongly depends on the experimental setup, in which subject programs, fault types, and the size of test suite are the most effective threats to experimental results.Therefore, to solve the inaccuracy problem in empirical study, some theoretical analysis are proposed to investigate the performance of risk evaluation formulas.
In [11,14,15], the investigation is on comparing two risk evaluation formulas and identifying maximal risk evaluation formula in a group of risk evaluation formulas.Please be reminded that the risk evaluation formula is used by the debugger to generate a ranking list of program spectra.If a formula can "put" the faulty spectrum in a higher position in the ranking list than that of another formula, the former formula is said to perform better than the latter formula.In the first aspect, the study of Xie et al. [14] divides program statements in the program under test into three mutually exclusive subsets: the subset that contains all statements with risk values greater than (denoted as S R B ), equal to (denoted as S R F ), and smaller than (denoted as S R A ) that of the faulty statement in the program, assuming the program only contains one faulty statement where R is the risk evaluation formula.Please note that the faulty statement must be in S R F because it contains all statements that have the same risk value as the faulty statement.Loosely speaking, these three subsets divide the ranking list into three parts: the top part being S R B because its statements appear "before" the faulty statement, since its risk values are greater than that of the faulty statement; the middle part being S R F ; and the bottom part being S R A because it appears "after" the faulty statement.Given any two formulas R 1 and R 2 , a program with a faulty statement and a test suite, R 1 is said to be equivalent to R 2 (denoted by While in [12,13], two equivalent risk evaluation formulas require strictly identical ranking lists.Furthermore, R 1 is said to be better than R 2 (denoted by A .This is because R 1 will place the faulty statement earlier in the ranking list than R 2 .As a result, debugger using the ranking list from R 1 will get to reach the faulty statement earlier than using R 2 .As can be seen from the definition of "better", the set of all risk evaluation formulas equipped with the "better" relation becomes a partially ordered set (poset).As a result, the maximal elements in this poset will then be referred to as "maximal risk evaluation formulas" or simply "maximal formulas", if it is clear from the context.Following the definition in [14], a formula R 1 is said to be a maximal formula in a set S of formulas, if for any formula R 2 ∈ S, R 2 is better than R 1 implies R 2 is equivalent to R 1 because no other formulas can outperform a maximal formula.
The theoretical analysis in [11,14,15] shows that among 30 risk evaluation formulas, there are five maximal formulas with the assumption that only one fault is in the program.Moreover, these five formulas can be grouped into two groups in which all formulas in the same group have the same performance.These two maximal groups are ER1 and ER 5 .Formulas in ER1 are Naish1 (abbreviated as N1) and Naish2, both are from [13], whereas formulas in ER 5 are Wong1 (abbreviated as W1) from [9], Russel & Rao from [28], and Binary from [13].In the follow-up work [16,17], the researchers proposed to use genetic programming (GP) techniques to come up with 30 GP-evolved risk evaluation formulas.Among these 30 GP-evolved formulas, they further identified four more maximal formulas; namely, GP02, GP03, GP13, and GP19 [27].Moreover, GP13 was proven to be equivalent to those maximal formulas in the ER1 group [16].As a result, the ER1 group is now denoted as ER 1 , which also includes GP13.The other three GP-evolved formulas form three new groups of maximal formulas.In summary, among 60 risk evaluation formulas, there are nine maximal formulas which can be grouped into five maximal formula groups.All five maximal formula groups are listed in Table 1.
Since no other formulas can outperform the formulas in these five maximal risk evaluation formula groups, it is intuitively appealing to do more research work on comparing the performance across these five maximal formula groups.Furthermore, Yoo et al. [18] showed that there never exists a greatest formula outperforming all other formulas.

Comparing Two Risk Evaluation Formulas
In this section, we propose a new approach to compare between two risk evaluation formulas.Let us discuss the approach used to judge whether a risk evaluation formula is better than another in previous work.
As mentioned previously, given a risk evaluation formula R, the ranking list can be divided into three mutually exclusive subsets, S R B , S R F , and S R A , in which statements are ranked with their suspiciousness.Empirical or theoretical comparison of two risk evaluation formulas R 1 and R 2 is then judged by the possible situations arising from six subsets: , and S R 2 A .For empirical study, researchers use the exact locations of the faulty statement appearing in the ranking lists of R 1 and R 2 to compare.So, the faulty statement will be in S R 1 F and in S R 2 F .For R 1 to be better than R 2 , the exact location of the faulty statement in R 1 's ranking list should be higher than that in R 2 .
On the other hand, for theoretical comparison of two risk evaluation formulas R 1 and R 2 , R 1 is said to perform better than A " [14].Figure 2 depicts this situation.In fact, when A , R 1 and R 2 have the same performance.In addition, readers may argue the fact that this may not be very accurate.For example, after the ranking, if the "actual faulty statement" is located at the end of S R 1 F for R 1 and is located in the front of S R 2 F for R 2 , R 2 can perform better than R 1 .The concept of consistent tie-breaking scheme is assumed in the theoretical study.This assumption of using consistent tie-breaking to compare the performance between two formulas seems reasonable without further information.
Based on this idea, we propose a new condition for comparing two risk evaluation formulas R 1 and R 2 .We first define our notations.For a risk evaluation formula R, we use n R B , n R F and n R A to denote the number of statements in S R B , S R F , and S R A .Let n denote the total number of statements in a program.Please observe that n We now define the expected faulty location, EFL, of a risk evaluation formula R by where n R B and n R F are the numbers of statements in S R B and S R F respectively.This is, in fact, the expected location of the faulty statement in the ranking list generated by the risk evaluation formula R. Please observe that EFL(R) . By assuming that the faulty statement is at the middle of S R F , we can then calculate the ranking of the faulty statement.We now formally define our condition for comparing two risk evaluation formulas.For two risk evaluation formulas R 1 and R 2 , we said that R 1 performs better than R 2 if the expected faulty location of R 1 is in front of that of R 2 ; that is, EFL(R 1 ) < EFL(R 2 ).We have the following proposition.

Proposition 1. For any two risk evaluation
Proof.For any two risk evaluation formulas R 1 and R 2 , the expected faulty locations of R 1 and A , the expected faulty location for R 1 is in front of that for R 2 ; that is, EFL(R 1 ) < EFL(R 2 ).Hence, R 1 performs better than R 2 .

Five Scenarios for One Formula Better Than Another
In a previous study [14], formula R 1 is said to be better than R 2 with the condition "S A ", which means that the number of statements in S R 1 B is smaller than that in S R 2 B , and the number of statements in S R 1 A is larger than that in S R 2 A .However, our experimental results show that this is not the only scenario for one formula better than another.In the following, we present five scenarios when a risk evaluation formula R 1 performs better than R 2 .We denote them as B (<) A (>) , B (<) A (=) , B (<) A (<) , B (=) A (>) , and B (>) A (>) , where (1) B indicates the comparison pair between the number of statements in the subsets S A and S R 2 A ; and (3) <, =, and > respectively mean the former number is smaller than, equal to, and larger than the latter one.They are 3 depicts these five different scenarios, which cover all possible cases when one risk evaluation formula performs better than another one.

Subject Programs and Test Suite
Eleven C programs whose number of executable statements (eLOC) range from 135 to 9932 with an average of 3137.8 eLOC were selected for our study.The eLOCs of these programs were counted by SLOCCount 2.26 [29].These programs were selected because they have been used in fault localization experiments performed by other researchers [25,[30][31][32].These programs were downloaded from the Software-artifact Infrastructure Repository (SIR) [33].These 11 programs are (1) three UNIX utilities, namely, flex, grep, and sed; (2) one real life application space; and (3) seven small programs usually referred to as the Siemens suite.The following is a description of these 11 programs.

•
Flex is an UNIX utility to generate lexical analyzer by scanning a lex file with definitions, rules, and user code contained.The generated analyzer then transforms the input stream into a sequence of tokens.

•
Grep is a pattern matching engine.Given one or more patterns and some input files for searching, it outputs the lines that match one or more of the patterns.

•
Sed is a stream editor to perform operations on the input stream, such as replacement, deletion, and insertion on a specific line or the global text.

•
Space is an interpreter for an array definition language (ADL) to check the ADL grammar and specific consistency rules.If the ADL file is correct, space outputs an array data file; otherwise, the program outputs error messages.

•
Print_tokens and print_tokens2 are two lexical parsers used to group input strings into tokens and identify the token categories.The main difference between these two programs is that print_tokens uses a hard-coded DFA, while print_tokens2 does not.

•
Replace is a program of regular expression matching and substitution.It replaces any substring matched by the input regular expression with a replacement string, and outputs a new file.

•
Schedule and schedule2 are used to schedule the priority in three job lists.Schedule is non-preemptive and schedule2 is preemptive.

•
Tcas is used to avoid air accident by detecting on-board conflict through radar system and providing a resolution advice, such as climb, descend, or remain on the current trajectory.

•
Tot_info takes a set of tables as input and outputs the Kullbacks information measure, degrees of freedom, and possibility density of a χ 2 distribution for each table and the summary of the entire set.
Since previous fault localization investigations on the effectiveness of risk evaluation formulas have the assumption that the faulty program contains one fault, e.g., [14], we used the same assumption in our empirical study.Related to the faulty versions of these 11 programs on SIR, some of them have multiple faults, and hence, were excluded from our experiments.For the three UNIX utilities, it is unfortunate that all their faulty versions contain multiple faults.Hence, we manually generated five faulty versions for each of these UNIX utilities and each faulty version contained only one single fault.The space program has 38 faulty versions with real faults.Only seven out of these 38 faulty versions contain single faults, and hence were selected for our experiments.The Siemens suite includes seven small C programs with various seeded faults in their faulty versions.Some of these faulty versions were excluded from our experiments as they contain more than one fault.Most of the faulty statements are conditional statements and assignment statements.For example, the faulty statement of print_tokens2 v6 is "if (isdigit(*(str+i+1)))", which should be "if (isdigit(*(str+i)))".A faulty version that is selected or manually generated for our empirical study is referred to as a "selected faulty version" whenever it is clear from the context.
Table 2 summarizes the information of these programs for our empirical study.As discussed earlier, the eLOC column reports the number of executable statements collected by SLOCCount 2.26 [29].The mutants generated from the UNIX utilities, the selected faulty versions of space and the Siemens suite, the number of manually mutated or selected faulty versions, and the total number of faulty versions provided in SIR are listed in the third column.The size of each test suite is listed in the second to last column-obtained from the individual "universe" test plan.In summary, our empirical study had 15 mutants for the three UNIX utilities, seven faulty versions for the space program, and 26 faulty versions for the Siemens suite.The size of the corresponding test suite ranged from 441 to 13,550 test cases, with an average of 3782.9 test cases per selected faulty version.

The Empirical Process
We performed an empirical study to compare the effectiveness of five maximal formula groups, ER 1 , ER 5 , GP02, GP03 and GP19.It has been proven by Xie et al. [14] that all formulas in the same maximal formula group have the same performance.Therefore, it is sufficient to select only one representative formula from each group for study.In other words, if another formula is chosen, we would still observe the same results.We chose Naish1 (abbreviated as N1) from ER 1 and Wong1 (abbreviated as W1) from ER 5 because they are the simplest formulas in their own groups.We chose GP02, GP03, and GP19 from the last three groups because there is only one maximal formula contained in each group.
The empirical study aimed to compare the performance of these formulas in a pairwise manner.As a result, we have 10 comparison pairs.For ease of reference, we use CP1-CP10 to denote these 10 pairs, in which CP1 denotes the comparison pair between N1 and W1, CP2 for N1 and GP02, CP3 for N1 and GP03, CP4 for N1 and GP19, CP5 for W1 and GP02, CP6 for W1 and GP03, CP7 for W1 and GP19, CP8 for GP02 and GP03, CP9 for GP02 and GP19, and finally, CP10 for GP03 and GP19.
For the selected 11 C programs, there were, altogether, 48 faulty versions selected for the empirical study.For each selected faulty version, it was executed with respect to its corresponding test suite.All of the executions of faulty programs with respect to their test suites were performed using Fedora 20 64-bit virtual machine with one processor and 4GB of memory.After the execution, all information related to the program spectrum was recorded; namely, e i f , e i p , n i f , and n i p , for every statement s i in the selected faulty version.Once all information had been collected, we then applied the five maximal risk evaluation formulas, each from its own formula group, to calculate the risk values for each statement and get the corresponding ranking lists.As a reminder, statements with the same risk value were then ordered according to their corresponding statement IDs in the faulty program.For each maximal formula R in {N1, W1, GP02, GP03, and GP19}, we then divided its ranking list into three mutually exclusive subsets; namely, S R B , S R F , and S R A .We then calculated the expected faulty location, EFL, of the risk evaluation formula R using the formula Once we calculated all the expected faulty locations of these five formulas, we performed our analysis on the pairwise comparison between these formulas.For each comparison pair of formulas R 1 and R 2 , we conclude that R 1 performs better than R 2 on that instance of the selected faulty version when EFL(R 1 ) < EFL(R 2 ).When EFL(R 1 ) = EFL(R 2 ), the two formulas have the same performance.When EFL(R 2 ) < EFL(R 1 ), we conclude that R 2 performs better than R 1 .

Experimental Results and Analysis
Table 3 lists all the results of these 10 comparison pairs.For example, "N1" in the cell of row "tcas v1" and column "CP1" indicates that "N1 performs better than W1".Some cells in Table 3 have "same" meaning that the two risk formulas in the comparison pair have the same performance.
Since the performance of maximal formulas in each comparison pair varies with different programs, Table 4 summarizes the results as the percentage of "R 1 which performs better than R 2 ", "R 1 which performs the same as R 2 ", and "R 2 which performs better than R 1 " for each comparison From Table 4, we have the following observations.
a.For CP1, we observe that N1 has a higher chance to perform better than W1 for all programs except grep, sed, and schedule2.The percentage values of N1 W1 range from 33.3% to 100% with an average of 74.6%, whereas those of W1 N1 range from 0% to 66.7% with an average of 24.3%.Hence, we can conclude that N1 performs more-frequently-better than W1. b.Similarly, for CP2, CP3 and CP4, we can also conclude that N1 has a higher chance to perform better than GP02, GP03, and GP19, respectively.However, we have to point out an interesting observation of the N1 and GP03 pair.For N1 and GP03, the percentages of N1 GP03 range from 0% to 80% with an average of 16.0%, whereas those of GP03 N1 range from 0% to 40% with an average of 8.1%.In fact, N1 and GP03 have the same performance with an average of 75.9%.
(2) GP03 has a higher chance to perform better than W1, GP02 and GP19.a.For CP8, GP03 has a higher chance to perform better than GP02 for all programs except flex, replace, schedule, and tcas.In fact, for replace and schedule, the chances for GP03 GP02 and that of GP02 GP03 are the same; both are 0% for schedule and 33.3% for replace.The percentage values of GP03 GP02 range from 0% to 80% with an average of 44.9%, whereas those of GP02 GP03 range from 0% to 100% with an average of 27.1%.Hence, we can conclude that GP03 performs more-frequently-better than GP02.b.Similarly, for CP6 and CP10, we can also conclude that GP03 performs more-frequently-better than W1 and GP19 since the average percentage values of GP03 W1 and GP03 GP19 are 64.0%and 63.9% respectively.
(3) GP02 has a higher chance to perform better than W1 and GP19.a.For CP5, we observe that GP02 performs more-frequently-better than W1 for all programs except grep, sed and schedule2.The percentage values of GP02 W1 range from 0% to 100% with an average of 66.8% whereas those of W1 GP02 range from 0% to 100% with an average of 31.4%.Hence, we can conclude that GP02 performs more-frequently-better than W1. b.Similarly, for CP9, we can also conclude that GP02 performs more-frequently-better than GP19.
(4) W1 has a higher chance to perform better than GP19.
a.For CP7, W1 performs more-frequently-better than GP19 for all programs except flex, grep, sed and schedule2.In fact, for flex and grep, the chances for W1 GP19 and that of GP19 W1 are the same; both are 40% for flex and 40% for grep.The percentage values of W1 GP19 range from 33.3% to 100% with an average of 66.1% whereas those of GP19 W1 range from 0% to 66.7% with an average of 27.8%.Hence, we can conclude that W1 performs more-frequently-better than GP19.
In summary, it is very likely to observe that N1 GP03 GP02 W1 GP19.An interested reader may argue that the result of "N1 GP03 GP02 W1 GP19" is based on the percentage of the selected faulty versions per individual program, and we compare them by taking the average of these percentages.This may be unfair because, for tcas, there is only one selected faulty version.Only tcas v1 was selected as the faulty version since the scale of tcas subject program is too small, and most of the faulty versions are of the same type.Hence, a 100% of N1 W1 in tcas may actually add some advantages to N1 compared with other individual programs.So, one may then ask what the result would be if we used the total number of selected faulty versions to compare between these risk evaluation formulas.The answer to this question is, "We would have the same result."We are going to discuss this in the rest of this paragraph.Table 5 shows, for each comparison pair R 1 and R 2 , the total number of selected faulty versions that fall into the results of "R 1 better", "R 2 better", and "R 1 same as R 2 ".For example, in the row of CP3 (the pair of N1 and GP03), there are nine selected faulty versions that fall into "N1 better", five in "GP03 better", and 34 in "N1 same as GP03".Hence, N1 performs more-frequently-better than GP03.From Table 5, we have the following observations: (1) For CP1-CP4, N1 has a higher chance to perform better than W1, GP02, GP03, and GP19.
(3) For CP5 and CP9, GP02 has a higher chance to perform better than W1 and GP19.(4) For CP7, W1 has a higher chance to perform better than GP19.
As a result, we have the same observation.That is, N1 GP03 GP02 W1 GP19.Since N1 and W1 are representative formulas from their original maximal formula groups ER 1 and ER 5 respectively, we can conclude further that any maximal formula in the ER 1 group performs more-frequently-better than GP03, which in turns performs more-frequently-better than GP02, which in turns performs more-frequently-better than any maximal formula in the ER 5 group, which in turns performs more-frequently-better than GP19.That is, we have ER 1 GP03 GP02 ER 5 GP19.Here, we extend the meaning of " " to compare between two maximal formula groups.In other words, if GpA and GpB are two maximal risk evaluation formula groups, GpA GpB means R a R b where R a is a risk evaluation formula in GpA and R b is in GpB.Please be remindeded that all risk evaluation formulas in the same maximal formula group have the same performance.As discussed in Section 3.2, when comparing between two risk evaluation formulas R 1 and R 2 , there are five scenarios to characterize all possible cases of R 1 performs better than R 2 .Table 6 summarises the scenarios covered by each subject program.For example, for the case of "flex", among totally 50 comparison results (10 comparison pairs × 5 mutants of flex), there are 22 comparison results of R 1 R 2 obtained from Scenario B (<) A (>) , six from Scenario B (<) A (=) , five from Scenario B (<) A (<) , six from Scenario B (=) A (>) , two from Scenario B (>) A (>) , and nine comparison results indicating R 1 = R 2 .As shown in Table 6, 51.7% cases of R 1 R 2 fall in Scenario B (<) A (>) , 3.8% in Scenario B (<) A (=) , 11.5% in Scenario B (<) A (<) , 13.9% in Scenario B (=) A (>) , and 19.1% in Scenario B (>) A (>) .According to the observation, Scenario B (<) A (>) covers the majority of cases that R 1 R 2 .

Discussion
Yoo et al. [17] achieved a conclusion to some extent different from the conclusion in our experiment.Their results concluded that ER 1 is the best performer, ER 5 is the worst in most cases, and the other three formula groups GP02, GP03, and GP19 perform similarly, but with GP03 being better than GP02 and GP19.The difference between their study and our study is probably because different metrics and experimental setups were used.Yoo et al. used the expense metric to measure the number of statements should be examined before the faulty statement is found.In this paper, we utilized the expected faulty location to measure the expected location of faulty statement in the faulty program.Additionally, Yoo et al. conducted an empirical study with five subject programs from SIR, including flex, grep, gzip, sed, and space.In our experiment, we performed the empirical study with 11 subject programs in which flex, grep, sed, and space were used as well.Nevertheless, their conclusion is slightly different to that of our study of ER 1 GP03 GP02 ER 5 GP19.Most importantly, Yoo et al. and ourselves both observed that ER 1 is the best performer followed by GP03; that is, we have the same recommendation of using ER 1 .

Test Suite
There are two threats related to the test suite used in our empirical study.First is the size of the test suite.In our empirical study, we treated all test cases as one single test suite for fault localization purposes as in previous empirical study.This has been referred to as the "universe" plan in benchmarks.As mentioned in Naish et al. [13], the performance of a risk evaluation formula "may" be dependent on the actual number of test cases used in the test suite.Hence, readers should not over-generalize our results without further research.Second, it is related to the composition of test cases in a test suite.Intuitively speaking, different test cases may have different fault detection capabilities.Hence, the performance of the same risk evaluation formula may be different if two different test suites having the same number of test cases are used.It may be more interesting to investigate the diversity of test cases in a test suite to detect different types of faults in our future work.

Fault Type
In order to adapt to our experiment and localize the faulty statement, we excluded those faulty versions which deleted or inserted some statements in the original programs.Also, we performed our experiments on those programs with single faults.For future work, we can extend our experimental study using programs with multiple faults.

Related Work
The effectiveness of different SBFL techniques strongly depends on the input test suite and corresponding execution results.In contrast to the assumption of the existence of test oracles in conventional SBFL techniques, Xie et al. [34,35] presented an alleviation approach to solve the test oracle problem by using the metamorphic slice and Zhang et al. [31] used the unlabelled test cases.Additionally, a kind of test case prioritization technique is presented to improve the effectiveness of fault localization process and reduce the testing cost [36][37][38].The FLINT has been proven to outperform similar localization techniques in 52% of the cases in [39].Yu et al. [40] investigated different test suite reduction strategies as well for SBFL effectiveness increase.
More empirical comparisons of the performance between various formulas were also reported in [10,17].Pearson et al. [10] compared the performances of seven different formulas on artificial faults and real faults from Defects4J and indicated that the artificial faults were not as useful as the real faults to predicate the best formula.Xu et al. [32] found that labeling perturbations influenced the robustness of risk evaluation formulas significantly, especially the impacts of mislabeling passed cases as failed cases.

Conclusions and Future Work
Various empirical and theoretical research works on SBFL risk evaluation formulas have been proposed to compare the performance between different formulas.Five maximal formula groups-ER 1 , ER 5 , GP02, GP03 and GP19-have been proven and it was also proven that there does not exist a greatest risk evaluation formula in terms of fault localization effectiveness.From the experimental observation, we notice that some maximal formulas can perform better than others more frequently.Hence, we propose a notion of "more-frequently-better" to compare between two maximal formulas R 1 and R 2 in this paper.
To verify our proposition that there exists one maximal formula performs more-frequently-better than another one, an empirical study on 11 C programs with real and seeded faults has been conducted.According to the experimental results, we conclude that ER 1 GP03 GP02 ER 5 GP19.This means, ER 1 performs more-frequently-better than any of the other four.Therefore, we have provided a way to compare the performance between any two maximal formulas using the notion of more-frequently-better, and illustrated the feasibility of our approach with an empirical study.
Since we make the assumption that there is a single fault in the faulty program, more experimental studies using programs with multiple faults should be conducted to investigate the effectiveness of our approach.

R 1 B
and S R 2 B ; (2) A represents the comparison pair between the number of statements in the subsets S R 1

Figure 3 .
Figure 3. Five scenarios for R 1 performs better than R 2 .

Table 1 .
Maximal risk evaluation formula groups.
f +n f +e p +n p Binary

Table 2 .
Subject programs and test suite.

Table 4 .
Comparison results as percentages of one formula being better than another (%).

Table 5 .
Comparison results as number of selected faulty versions that one formula is better than another.

Table 6 .
Scenarios covered by individual subject program.