Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. However, practical adoption of such prediction models always depends on their real-world performance and the level of disturbing misclassification (i.e., false-positive hits) they produce. Despite the relative maturity of the bug prediction research area, the practical utilization of the state-of-the-art models is still very low due to the reasons mentioned above.
Bug prediction models can use a diverse set of features to build effective prediction models. The most common types of such features are static source code metrics [1
], process metrics [5
], natural language features [8
], and their combination [10
]. All these metrics proved to be useful in different contexts, but the performance of these models may vary based on, for example, the language of the project, the composition of the project team, or the domain of the software product. We need further studies to understand better how and when these models work best in certain situations. Additionally, we can refine source code metrics by using static and dynamic analysis in combination, which has a yet unknown impact on the performance of bug prediction models.
] bug dataset (https://github.com/BugsJS
function, which is impossible to capture statically. We analyzed the impact of these additional hybrid source code metrics on the function-level bug prediction models trained on this dataset.
We found that using invocation metrics calculated by a hybrid code analysis as bug prediction features consistently improves the performance of the ML prediction models. Depending on the ML algorithm, applied hyper-parameters, and target measure we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, even though replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improve model performances, keeping them all together yields the best results. It implicates that hybrid call metrics indeed add some complementary information to bug prediction.
The rest of the paper is structured as follows. In Section 2
. Section 4
contains the results of comparing bug prediction models using only static, only hybrid, or both static and hybrid metrics as features for machine learning models. We enlist the possible threats to our work in Section 5
and conclude the paper in Section 6
2. Related Work
Using call-graphs for source code and program analysis is a well-established and mature technique; the first papers dealing with call-graphs date back to the 1970’s [14
]. Call-graphs can be divided into two subgroups based on the method used to construct them: dynamic [17
] and static [18
Dynamic call-graphs can be obtained by the actual run of the program. During the run, several runtime information is collected about the interprocedural flow [19
]. Techniques such as instrumenting the source code can be used for dynamic call-graph creation [16
In contrast, there is no need to run the program in the case of static call-graphs, as it is produced by a static analyzer which analyzes the source code of software without actually running it [16
]. On the other hand, static call-graphs might include false edges (calls) since a static analyzer identifies several possible calls between functions that are not feasible in the actual run of a program; or they might miss real edges. Static call-graphs can be constructed in almost any case from the source code, even if the code itself is not runnable.
Different analysis techniques are often combined to obtain a hybrid solution, which guarantees a more precise call-graph, thus a more precise analysis [21
]. However, constructing precise static call-graphs for dynamic scripting languages is a very hard task that is not fully solved yet [23
]. The eval()
, and bind()
]. However, the most reliable method is to use dynamic approaches to detect such call edges. We decided to use both dynamic and static analysis to ensure better precision even though it increases the analysis time, and the code should be in a runnable state due to the dynamic analysis.
]. By applying dynamic analysis, they could collect information for even those situations that are hard to analyze statically. Dynamic results (execution traces) are propagated to a static infrastructure, which embeds a call-graph builder as well. This call-graph builder module makes use of the dynamically identified calls. However, in the case of pure static analysis, they wrapped the WALA tool (https://github.com/wala/WALA
) to construct a static call-graph. As previously said, our approach works similarly and also supports additional call graph builder tools to be included in the flow of the analysis.
Feldthaus et al. presented an approximation method to construct a call-graph [22
]. Madsen et al. focused on the problems induced by libraries used in the project [29
]. They used pointer analysis and a novel “use analysis” to enhance scalability and precision.
There are also works intending to create a framework for comparing call-graph construction algorithms [30
]. However, these are done for algorithms written in Java and C. Call-graphs are often used for preliminary analysis to determine whether an optimization can be done on the code or not. Unfortunately, as they are specific to Java and C, we could not use these frameworks for our paper.
Clustering call-graphs can have advantages in malware classification [32
], they can help localizing software faults [33
], not to mention the usefulness of call-graphs in debugging [34
Musco et al. [35
] used four types of call-graphs to predict the software elements that are likely to be impacted by a change in the software. However, they used mutation testing to assess the impact of a change in the source code. The same methodology could have been used but with a slight change: instead of using an arbitrary change, it can be a vulnerability introducing or a vulnerability mitigating change.
Nuthan Munaiah and Andrew Meneely [36
] introduced two novel attack surface metrics with their approach, which are the “Proximity” and “Risky Walks” metrics. Both of them are defined by the call-graph representation of the program. Their empirical study proved that using their metrics to build a prediction model can help to predict more accurately as their metrics are statistically significantly associated with the vulnerable functions.
Nguyen et al. [37
] proposed a model to predict vulnerable components based on a metric set generated from the component dependency graph of a software.
Cheng et al. [38
] presented a new approach to detect control-flow-related vulnerabilities called VGDetector. They applied a recent graph convolutional network to embed code fragments in a compact representation (while the representation still preserves the high-level control-flow information).
Neuhaus et al. [39
] presented a fully automatic way to map vulnerabilities to software components and a tool called Vulture that can automatically build predictors to predict vulnerabilities in a new component. They identified that imports and function calls have an impact on whether a component vulnerable or not. They also made an evaluation of Mozilla’s codebase that showed that their approach is accurate.
Lee et al. [40
] proposed a new approach to generate semantic signatures from programs to detect malware. They extracted the call-graph of the API call sequence that would be generated by malware, called code graph. This graph is used for the semantic signature. They used semantic signatures to detect malware even if the malware is obfuscated or the malware slightly differs from its previous versions (these are the main reasons why a commercial anti-virus does not detect them as malware).
As these previous studies show, the advantage of call-graphs is present in predicting vulnerabilities in software systems. We did not narrow down the type of defects. Our approach is generally applicable to arbitrary bug prediction.
The most similar to our study is possibly the work of Punia et al. [41
Our approach consists of numerous steps, which we present in detail in this section. Figure 1
shows the steps required to produce input for the machine learning algorithms.
3.1. BugsJS Dataset
] is a bug dataset inspired by Defects4J [42
BugsJS stores the forks of the original repositories and extends them by adding tags for their custom commits in the form of:
Bug-X: The parent commit of the revision in which the bug was fixed (i.e., the buggy revision)
Bug-X-fix: A revision (commit) containing only the production code changes (test code and documentation changes were excluded) introduced in order to fix the bug
denotes a number associated with a given bug. As out of the total 453 bugs, ESLint (https://github.com/eslint/eslint
) itself contains 333 bugs, we chose this project as input in our study.
3.2. Hybrid Invocation Metrics Calculation
As a first step, we have to produce the so-called hybrid call-graphs from which we can calculate the hybrid invocation metrics (i.e., HNII and HNOI). In order to understand what a hybrid call-graph is, let us consider Figure 2
, which shows the details of the node “hcg-js-framework” presented earlier in Figure 1
After this step, we augment this merged JSON with confidence levels for the edges. The confidence levels are calculated based on a manual evaluation of 600 out of 82,791 call edges found in 12 real-world Node.js modules. We calculated the True Positive Rate for each tool intersection. We estimate the confidence of a call edge with these rates. For instance, if a call edge was found by tools A and B, and in the manually evaluated sample, there were ten edges found by only these tools, from which five turned out to be a valid call edge, we add confidence of 0.5 to all these edges.
shows a Venn diagram of the call edges found in 12 Node.js modules. We have an evaluation ratio for each intersection, which the framework uses for edge confidence level estimation.
To sum it up, a hybrid call-graph is a call-graph (produced by combining the results of both static and dynamic analysis) which associates a confidence factor to each call edge, which shows how likely an edge is valid (higher confidence means higher validity).
This hybrid call-graph is the input of the HNII, HNOI Counter which is responsible for calculating the exact number of incoming and outgoing invocations (i.e., NII, NOI). At this point, we have to specify the threshold value, which defines the lower limit from which we consider a call edge as a valid call edge, thus contributing to the value of the number of incoming and outgoing invocations. We considered four threshold values: 0.00, 0.05, 0.20 and 0.30. In the case of the first one, all edges are considered as possibly valid call edges, while the latter one only includes edges with a high confidence factor. We name these two new metrics as HNII (Hybrid Number of Incoming Invocations) and HNOI (Hybrid Number of Outgoing Invocations) to differentiate them from the original static NII and NOI metrics.
Listing 1. Sample output from the HNII, HNOI Counter.
3.3. SourceMeter and Patch Extraction
Besides computing the HNII and HNOI metrics, a standard set of metrics is provided by a static source code analyzer named SourceMeter (https://www.sourcemeter.com/
Similarly, we extracted the patches for these 333 bug fixing commits, which is done by Patch Extractor.
3.4. Composing Buggy Entries
To calculate the HNOI and HNII metrics, one needs to apply a threshold to the call edges (to decide which edges to consider as valid) in the underlying hybrid (also called as fuzzy) call-graph produced by the hcg-js-framework (see Section 3
). We calculated the metric values (all the data used in this study is available online [43
]) with four different thresholds: 0, 0.05, 0.2, and 0.3. Table 1
shows the descriptive statistics of the metrics on our ESLint dataset.
As can be seen, thresholds 0.20 and above significantly reduces the number of considered edges for HNII and HNOI calculation. We wanted to use as many of the extracted call edges as possible, so we selected to use the 0.00 threshold later on (i.e., we considered each edge in the fuzzy call-graph where the weight/confidence is greater or equal to zero).
We trained several models on the dataset with three different configurations for the features:
Purely static metrics (
): the dataset contains only the pure static source code metrics (i.e., original versions of NOI and NII plus all the provided metrics by SourceMeter, see Section 3.3
Static metrics with only hybrid NOI and NII versions (): the dataset contains all the static metrics except NOI and NII, which are replaced by their hybrid counterparts (HNOII and HNII) calculated on the output of hcg-js-framework.
Both static and hybrid metrics (): the dataset contains all the static metrics plus the hybrid counterparts of NOI and NII (HNOII and HNII) calculated on the output of hcg-js-framework.
To have a robust understanding of the hybrid metrics’ impact, we trained nine different machine learning models:
Logistic Regression Classifier—Logistic regression is a statistical model that uses a logistic function to model a binary dependent variable (implemented by sklearn.linear_model.LogisticRegression);
Naive Bayes Classifier—Naive Bayes classifier is a simple “probabilistic classifier” based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features (implemented by sklearn.naive_bayes.GaussianNB);
Decision Tree Classifier—Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression, where the goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features (implemented by sklearn.tree.DecisionTreeClassifier an optimized version of the CART algorithm);
Linear Regression Classifier—Linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables also known as dependent and independent variables (implemented by sklearn.linear_model.LinearRegression);
Standard DNN Classifier—A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers (implemented using tensorflowf.layers.dense);
Customized DNN Classifier—A custom version of the standard DNN implementing th early stopping mechanism, where we do not train the models for a fixed number of epochs, rather stop when there is no more reduction in the loss function (implemented using tensorflowf.layers.dense);
Support Vector Machine Classifier—Support-vector machine (SVM) is a supervised learning model, which is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible (implemented by sklearn.svm.SVC);
K Nearest Neighbors Classifier—The k-nearest neighbors algorithm (k-NN) is a non-parametric method for classification and regression, where the input consists of the k closest training examples in the feature space (implemented by sklearn.neighbors.KNeighborsClassifier);
Random Forest Classifier—Random forest is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees (implemented by sklearn.ensemble.RandomForestClassifier).
With the various hyper-parameters, it added up to a total of 36 configurations. We executed all these 36 training tasks on all three feature sets, so we created 108 different ML models for comparison. To cope with the highly imbalanced nature of the dataset (i.e., there are significantly more non-buggy functions than buggy ones), we applied a 50% oversampling on the minority class. We also standardized all the metric values to bring them to the same scale. For the model training and evaluation, we used our open-source DeepWater Framework (https://github.com/sed-inf-u-szeged/DeepWaterFramework
], which contains the implementation of all the above algorithms.
To ensure that the results are robust against the chosen threshold, we trained the same 108 models with the threshold value of 0.30 for HNII and HNOI calculation. We found that the differences among S, H, and feature sets became less, but the general tendency that H and especially features achieved better results remained. Therefore, in the rest of the paper, we can use the HNII and HNOI metrics calculated with the 0.00 threshold without the loss of generality. In the remaining, we present our findings.
4.1. The Best Performing Algorithms
displays a heat-mapped table of the top 10 model results based on their recall values. We ranked all 108 models, meaning that all three feature sets are on the same list. We can measure recall with the following formula:
where TP means True Positive samples, while FN means False Negatives. As we can see, DNN (0.642) and KNN (0.635) models achieve the best recall values on the
feature set. The same models produce almost as high recall values (0.631) using only the H
feature set. The best performing model on the S
feature set is KNN, with a significantly lower (0.619) recall value. It shows that hybrid invocation metrics do increase the performance of ML models in terms of recall. The best values are achieved by keeping both the original NOI and NII metrics and adding their hybrid counterparts HNOI and HNII, but using only the latter ones as substitutes for the static metrics still improves recall values.
To visualize the difference in the various performance measures, we plotted a bar-chart (Figure 5
and Figure 6
) with the best DNN configurations (i.e., applying the set of hyper-parameters with which the model achieves the best performance) for all three feature sets. Blue marks the results using the
feature set, cyan the H
feature set, while yellow the S
results are superior, while H
results are better than the S
results except for the False Positive and True Negative instances. The chart shows that there is a constant 3–4% improvement in all aspects of the DNN model results if we add the hybrid metrics to the feature sets.
displays a heat-mapped table of the top 10 model results based on their precision values. We ranked all 108 models, meaning that all three feature sets are on the same list. We can measure precision with the following formula:
where TP means True Positive samples, while FP means False Positives. As we can see, the SVM model (0.829) achieves the best precision values on the H
feature set. Interestingly, SVM produces an almost as high precision value (0.827) using only the S
feature set as well. Based on the
feature set, SVM achieves a precision value of 0.824. It shows that hybrid invocation metrics do increase the performance of ML models in terms of precision, but not as significantly as in the case of recall values. Nonetheless, for other algorithms than SVM, the increase is more significant.
To visualize the difference in the various performance measures, we plotted a bar-chart (Figure 8
and Figure 9
) with the best SVM configurations for all three feature sets. Blue marks the results using the
feature set, cyan the H
feature set, while yellow the S
results are superior, while H
results are still better than S
results for all measures. The chart shows that there is a constant 1–2% improvement in all aspects of the SVM model results if we add the hybrid metrics to the feature sets.
displays a heat-mapped table of the top 10 model results based on their F-measure values. We ranked all 108 models, meaning that all three feature sets are on the same list. We can calculate F-measure with the following formula:
As we can see, Random Forest (0.648) and KNN (0.641) models achieve the best F-measures on the feature set. The same Random Forest models produce almost as high F-measures (0.647) using only the H feature set. The best performing model on the S feature set is not even in the top 10. It shows that hybrid invocation metrics do increase the performance of ML models in terms of F-measure, meaning they improve the models’ overall performance. The best values are achieved by keeping both the original NOI and NII metrics and adding their hybrid counterparts HNOI and HNII, but using only the latter ones as substitutes for the static metrics still improves F-measure significantly.
To visualize the difference in the various performance measures, we plotted a bar-chart (Figure 11
and Figure 12
) with the best Random Forest configurations for all three feature sets. Blue marks the results using the
feature set, cyan the H
feature set, while yellow the S
results are better than S
results for all measures except for recall, but the difference there is only marginal. The chart shows that there is a constant 1–2% improvement in all aspects of the Random Forest model results, but precision is higher by approximately 10% if we add the hybrid metrics to the feature sets.
4.2. The Most Balanced Algorithm
K-nearest neighbor models stand out in that they produce the most balanced performance measures. As can bee sen in Figure 13
and Figure 14
, both precision and recall values are above 0.6, therefore F-measure is above 0.6 as well. For this model, H
feature set brings a 1–2% improvement over the S
feature set, while
feature set results in a 2–5% increase in performance.
4.3. Significance Analysis of the Performance Measures
Despite a seemingly consistent increase in every model performance measures caused by adding hybrid source code metrics to the features, we cannot be sure that this improvement is statistically significant. Therefore, we performed a Wilcoxon signed-rank test [45
] on the model F-measure values between each pair of feature sets (S vs. H, S vs. S + H, H vs. S + H). The detailed results (T statistics and p
-values) are shown in Table 2
4.4. Results Overview and Discussion
In the previous sections, we analyzed the best performing algorithms with a focus on the improvements caused by the hybrid source code metrics. However, to have a complete picture of the results, we summarize the performances of all nine machine learning algorithms here. Table 3
shows the best prediction performances (i.e., models with best performing hyper-parameters and feature set) of all nine algorithms according to their F-measures.
5. Threats to Validity
There are several threats to the validity of the presented empirical study. As a training set, we used 333 bugs only from one system. Therefore, the results might be specific to this system and might not generalize well. However, ESLint is a large and diverse program containing a representative set of issues. Additionally, bugs are manually filtered, thus do not introduce noise in the prediction models. As a result, we believe that our study is meaningful, though replication with more subject systems would be beneficial.
The threshold value chosen for calculating the hybrid call edges might affect the ML model performances. We selected a threshold of 0 (i.e., counted every edge with a weight greater than zero) in our case study; however, we carried out a sensitivity analysis with different thresholds as well. Even though the calculated HNOI and HNII values changed based on the applied threshold, the model improvements using these values proved to be consistent with the ones presented in the study. Therefore, we believe that the essence of the results is independent of the choice of the particular threshold value.
Finally, the provided thresholds might be inaccurate as we derived them from a manual evaluation of a small sample of real call edge candidates. To eliminate the risk of human error, two senior researchers evaluated all the edges who had to agree on each call label. For sampling, we applied a stratified selection strategy, so we evaluated more call samples from subsets of tools finding more edges in general, thus increasing the confidence of the derived weights.
We created three versions of a training dataset from the functions of the ESLint project. We used the BugsJS public dataset to find, extract, and map buggy functions in ESLint. We ended up with a dataset containing 824 buggy and 1943 non-buggy functions with three sets of features: static metrics only, static metrics where the invocation metrics (NOI and NII) are replaced by their hybrid counterparts (HNOI and HNII), static metrics with the addition of the hybrid metrics.
We trained nine different models in 108 configurations and compared their results. We found that using invocation metrics calculated by a hybrid code analysis as bug prediction features consistently improves the performance of the ML prediction models. Depending on the ML algorithm, applied hyper-parameters, and target measure we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, even though replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances, most of the time, keeping them both yields the best results. This means that they hold somewhat complementary information to each other. To achieve the highest accuracy and precision, one should choose the Random Forest Classifier method, but if the recall is also important and one wants to have as balanced results as possible, the K Nearest Neighbors Classifier is the best possible option.