Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?

: Modern systems produce and handle a large volume of sensitive enterprise data. Therefore, security vulnerabilities in the software systems must be identiﬁed and resolved early to prevent security breaches and failures. Predicting security vulnerabilities is an alternative to identifying them as developers write code. In this study, we studied the ability of several machine learning algorithms to predict security vulnerabilities. We created two datasets containing security vulnerability information from two open-source systems: (1) Apache Tomcat (versions 4.x and ﬁve 2.5.x minor versions). We also computed source code metrics for these versions of both systems. We examined four classiﬁers, including Naive Bayes, Decision Tree, XGBoost Classiﬁer, and Logistic Regression, to show their ability to predict security vulnerabilities. Moreover, an ensemble learner was introduced using a stacking classiﬁer to see whether the prediction performance could be improved. We performed cross-version and cross-project predictions to assess the effectiveness of the best-performing model. Our results showed that the XGBoost classiﬁer performed best compared to other learners, i.e., with an average accuracy of 97% in both datasets. The stacking classiﬁer performed with an average accuracy of 92% in Struts and 71% in Tomcat. Our best-performing model—XGBoost—could predict with an average accuracy of 87% in Tomcat and 99% in Struts in a cross-version setup.


Introduction
Security vulnerability identification is essential to implementing and operating secured software systems. As such, the security of a system is also subject to the quality of its underlying source code. Poor code quality may make the system vulnerable to third-party access, i.e., introduce security threats. A previous study showed that an attack on the Equifax system exposed 143 million user accounts, for which the open-source nature of the Apache Struts system was at fault [1].
Previously, identification and prevention of security vulnerabilities or software bugs were performed using manual reviews, which today is impossible for systems with a large codebase. Thus, automatic code inspection tools (e.g., JSLint [2] or SonarQube [3]) were developed to identify vulnerabilities in software based on known code patterns or structures. However, the challenge of automatically identifying security vulnerabilities as early as possible in the software development lifecycle remains. In the literature, various software testing methods for identifying vulnerabilities, e.g., fuzztesting or fuzzing, are shown to be useful, however, at the later steps of the software development lifecycle [4].
NordVPN reported that without security patches installed, attackers may exploit vulnerabilities [5]. This study aimed to assist developers and security analysts by identifying security vulnerabilities in software systems. To do similarly, we have studied security vulnerabilities in two systems: a Jakarta servlet engine (Apache Tomcat, Forest Hill, MD, USA) and an MVC framework for creating elegant, modern Java web applications (Apache Struts, Forest Hill, MD, USA). We chose Apache Tomcat and Struts2-core because they are open-source systems with publicly available security logs.
The presence of security vulnerabilities in a system significantly increases the cost of quality assurance. In particular, modern large systems have a large volume of codebase, which makes it hard and time-consuming for developers to trace security vulnerabilities in the source code. Security vulnerabilities in a system usually go unmarked until the system is delivered and deployed. Therefore, in this study, intending to help developers and security analysts, we construct several machine learning (ML) models to predict security vulnerabilities in the source code, relying on static source code metrics that are easy to gather from the source code. We rely on publicly available vulnerability information to build two vulnerability datasets.
In this study, we answer the following four research questions: • RQ1: How do the individual learners perform while predicting the security vulnerabilities? • RQ2: Can we employ ensemble learning techniques to achieve a high predictive performance? • RQ3: How does the best performing model perform in cross-version prediction? • RQ4: How does the best performing model perform in cross-project prediction?
Our main contributions include: • Two datasets comprising vulnerability information and static source code metrics from two open-source systems: Apache Tomcat (7 versions-v4.x to v10.x) and Apache Struts2-core (11 versions-six 2.3.x and five 2.5.x minor versions); • The best feature set comprising static source code metrics to predict security vulnerabilities; • A set of ML models to predict security vulnerabilities (type, severity, and title); and • Evaluation of the best performing predictive model in the within project, cross-version, and cross-project setup.
The rest of the article is organized as follows: Section 2 presents background information relevant to the study, while Section 3 describes our research method. Section 4 presents the results of the experiments, and Section 5 provides the related work which inspired our study. We discuss results and additional experiments in Section 6. Finally, Section 7 concludes the article and outlines future plans.

Background
This section introduces security vulnerabilities in the two open-source systems (Section 2.1), the list of source code metrics we employed for building our prediction models (Section 2.2), and the machine learning (ML) algorithms used in this study (Section 2.3).

Security Vulnerabilities
Apache Tomcat has 22 distinct security vulnerabilities listed on the Apache Tomcat https://tomcat.apache.org/security.html (accessed on 14 July 2022). The distribution of vulnerabilities in Apache Tomcat is shown in Figure 1. Common vulnerabilities in Apache Tomcat include Denial of Service [6], Remote Code Execution [7], and Information Disclosure [8].
In the version of Apache https://struts.apache.org/security/ (accessed on 14 July 2022) we studied, there are eight unique security vulnerabilities. The distribution of vulnerabilities in Apache Struts is shown in Figure 2. Below, we discuss common security vulnerabilities.
A resource is made inaccessible in the Denial of Service attack. When a service is loaded with requests, authorized users cannot make requests [6]. In Remote Code Execution attack, an automated script is written to provide remote access to a potentially compromised system with an administrative privilege [7]. The Information Disclosure vulnerability occurs when a service or website inadvertently reveals sensitive information (e.g., passwords or financial details) to the users [8].
AJP Request Injection and Potential Remote Code Execution vulnerability occurs if AJP (Apache Jserv Protocol) connections are trusted prior to validation [9]. The Arbitrary File Deletion vulnerability occurs as a result of a Directory Traversal attack where an attacker, using malformed input, removes files on the web application. The HTTP/2 request mix-up vulnerability occurs when clients crowd the server with connections and the user requests made in order become mixed up. As a result, clients see responses for the previous requests. In the cross-site request forgery attack, a malicious operation is created to execute during the session of an authenticated user [10].
Finally, a Request Smuggling vulnerability occurs with non-validated incoming HTTP requests, thus, an attacker sends malformed requests resulting into, for example, an information disclosure vulnerability or directory traversal attack [11].   Table 1 provides definitions for the source code metrics considered (as features for training the models) in this study.

ML Algorithms
This section describes the ML algorithms employed in this study.

Naive Bayes Classifier (NB)
Naive Bayes classifier works on the Bayesian probability. The models assume that all features are independent of each other provided the class variable [13]. Suppose a class has the following metrics, CBO (Coupling Between Objects) of 13, logStatementsQty of 42, and TCC (Tight Class Cohesion) of 0.0298. Suppose this class is considered vulnerable. In that case, the NB classifier considers each of these feature possibilities to contribute independently to the probability that this class is vulnerable, regardless of any correlation among CBO, logStatementsQty, or TCC features. This would, in turn, help us to identify which individual feature would affect the occurrence of a security vulnerability. Additionally, these classifiers are particularly useful in large datasets such as ours. The Naive Bayes formula is defined as follows.

P(c|x) = P(x|c)P(c)/P(x)
Naive Bayes is a conditional probability model. Given a classification problem, represented by a vector x = (x 1 , . . . , x n ) representing n features (i.e., independent variables), it assigns to this instance probabilities P(C k |x) = (x 1 , . . . , x n ) for each of K possible outcomes or classes.

Logistic Regression (LR)
Contrary to its nomenclature, Logistic Regression (LR) is mainly used to classify target labels. Moreover, this model uses the Sigmoid function (logistic function) to predict or classify the targets. Figure 3 displays the curve traced by the Sigmoid function.
As explained in [14], Logistic regression is essentially just linear regression, but with a Sigmoid activation function. Linear regression works based on the formula for the slope of the line y = mx + b. Thus, in the case of Logistic Regression, the y value obtained from Linear regression is passed as a parameter to the Sigmoid function. The function returns a binary output (0 or 1). If the function goes toward positive infinity, then the output is 1. If the function goes toward negative infinity, the output becomes 0. The entire concept of this model is based on probability, so it provides confidence in predicting the targets. Additionally, this model can be used for binary and multi-class classification [14].

Decision Tree Classifier (DT)
A decision tree model follows a tree-like structure with the root at its top node. The tree branches are called edges, and each branch's end is a leaf. The decision trees can be used for both classification and regression purposes. However, since we have a classification problem at hand, this section will focus only on decision trees as classifiers.
In theory, the decision tree splits each feature into a binary decision, and the result of each decision is passed along to the next decision, influencing the final decision. Data comes in records of the form: The dependent variable, y, is the target variable we are trying to classify or generalize. The array x is composed of the features x 1 , x 2 , x 3 , and so on that affect the precision of prediction.
A common problem in DT that needs to be addressed is overfitting. Overfitting occurs when an ML model gets too comfortable with the training data and starts applying values from the training data instead of predicting using the obtained knowledge [15]. In the case of DT, overfitting is handled by pruning the tree. In layman's terms, this translates to intentionally snipping certain leaves of the DT to make the model fit the training data more accurately [16].

XGBoost Classifier (XGB)
XGBoost employs the Gradient Boosting technique to acquire more precise predictions. Usually, in ML predictions, errors are categorized into two kinds: bias and variance. Data bias in ML is an error in which certain dataset elements are more heavily weighted or represented than others. On the other hand, variance is an error from sensitivity to small fluctuations in the training set. A high variance may result from an algorithm modeling the random noise in the training data.
The bias-variance dilemma (or bias-variance problem) is the conflict in simultaneously minimizing these two sources of error that prevent supervised ML algorithms from generalizing beyond their training set [17]. Gradient boosting is a technique to mitigate bias errors in an ML model. The base estimator of this model is a one-level decision tree or a decision stump, as mentioned in [18]. XGBoost classifier is a simpler version of AdaBoost and other Gradient boosting algorithms.

Stacking Classifier (STK)
Stacking is an ensemble learning technique using multiple predictions and processing levels. A simple stacking classifier would have a level 1 of multiple classifiers. The predictions from level 1 are then passed to a meta-classifier (or a level 0 classifier) which gives the final prediction. The stacking classifier should be the last resort in most cases due to its performance and demand for resources [19]. Figure 4 illustrates our research method with five phases. In the following, we explain each of the phases:

Phase 1: Data Collection
The first phase concerns gathering the security vulnerability information from the Apache Tomcat and Struts2-core security logs. This is accomplished by downloading the HTML files for each version and using a Python package called BeautifulSoup https: //www.crummy.com/software/BeautifulSoup/bs4/download/4.0/ (accessed on 14 July 2022) to extract the GitHub links that point to the files impacted by a vulnerability. For each security vulnerability, we collect the type of vulnerability (22 titles of vulnerability), the severity of the vulnerability (four severity levels-High, Important, Moderate, and Low), and the Java classes in Apache Tomcat affected by the vulnerability. This way, a CSV (Comma-separated values) file is generated with all the affected classes and the vulnerability affecting them for both systems. Similarly, for the Struts2-core system, a total of 45 vulnerable classes was obtained, with eight unique vulnerability titles.
Afterwards, we gathered the Java source code of each relevant Apache Tomcat and Struts version. Using these source codes, we computed class-level static source code metrics for all the Java classes for each version of the two systems using the CK tool [12]. This step resulted in 43 source code metrics (as listed in Table 1) stored in two CSVs for two systems. Finally, for each of the two systems, the two CSVs (i.e., vulnerability information and source code metrics data) are merged using right join, resulting in 12,214 rows in the raw dataset of Apache Tomcat and 19,372 rows for Apache Struts.

Phase 2: Data Clean-Up
The second phase concerns dataset cleanup, i.e., handling nonessential information that might affect the performance of our predictive models. Aside from the data cleaning strategies, we also have the issue of class imbalance, as shown in Figure 5. Thus, to mitigate this issue, oversampling techniques were introduced. The dataset underwent a step of Random oversampling before being put to prediction. Undersampling techniques were also considered for the experiment. However, the results are discussed in Section 6 because this did not give us the expected prediction accuracy compared to oversampling technique.

Phase 3: Feature Selection
The tool in [12] can compute 43 source code metrics at the class level. Individual metrics are regarded as features. A subset of features with a higher impact on the predictor variable will be selected by applying feature selection techniques. In this study, we applied two feature selection techniques: • Sequential Forward Selection or SFS that starts with an empty model and fits the model with a unique feature at a time and selects the feature set with minimum p-value [20]; • Recursive Forward Elimination or RFE, which, in contrast, relies on greedy optimization and eliminates the less relevant features one by one [21].
Thus, we experiment with two opposite feature selection methods and explore their vulnerability prediction ability.

Phase 4: Building ML Models
The study focuses on classification problem, where the target variable is the security vulnerabilities. The dataset was observed to be imbalanced; hence, accuracy proved to be a poor metric for evaluation. Thus, precision and recall were chosen to be the metrics for evaluation.
We build the classifiers using three supervised ML algorithms: Naive Bayes (NB), Logistic Regression (LR), and Decision Tree (DT). The decision rule in NB is based on the Bayes' theorem [22]. The LR, an extension of linear regression, works better for discrete or categorical outcomes, where probabilities accompany predicted values. In DT, observations about an item are followed by conclusions about its target feature using a choice/selection tree. In the tree, the leaves serve as category labels, and the branches produce category labels according to the available alternatives or values of the features [23].
We make sure that the ML models are optimized independently. For the DT, we prune the model to compensate for over-fitting [15]. Moreover, thus, our pruned DT could predict with higher accuracy instead of a hyper-accurate outcome, which helped to predict the vulnerability of a class-whether a class had a security vulnerability or not. It is worth mentioning that the DT-based model was not sufficient to predict the severity (level of the vulnerability) and the title (name of the vulnerability), i.e., the DT-based classifier cannot handle multi-class prediction well.
Moreover, the other three models did not yield decent results while predicting the other two target variables. Hence a new ML model (XGBoost Classifier) was introduced. The introduction of this model proved fruitful as this model outperformed the other models.

RQ1 (Performance of Individual Learner)
For RQ1, both the feature selection algorithms were put to use. So the algorithms will select two sets of features (one for each dataset). This study compares ML models to select the optimal model among the classifiers and the feature set, which performs better when the models are run. It will be conducted by fitting both the datasets in the ML models and letting them predict all the target variables.

RQ2 (Performance of an Ensemble Learner)
RQ2 explores whether we could improve prediction performance using ensemble learning techniques. For this and further RQs, only the better-performing feature set is selected. The experiments are conducted, and the performance metrics are recorded.

RQ3 (Cross-version Prediction)
RQ3 explores the performance of the best-performing model (among the four models) from RQ1 as it attempts to predict vulnerabilities within a dataset with different versions or combinations of versions as the training and test datasets. The data separation takes place as follows: 1.
Total number of versions (n) is extracted from the dataset. For example, n for Tomcat is 7, and for Struts, it is 11; 2.
One version v1 ∈ n and another version v2 ∈ n are selected as the training and test set for the first iteration. For instance, in the first iteration for Tomcat, version 4 is used to train, and the models are tested on version 5; 3.
For the next iteration, v1 + v2 ∈ n and v3 ∈ n are selected as the training and test set, respectively. Thus, the model is trained on versions 4 and 5 and is tested on version 6. This process continues until the final version, with the training set with one or more versions and the models being tested on the subsequent version.

RQ4 (Cross-project Prediction)
RQ4 explores whether the data points of one system (Tomcat) can be used to train the models and be tested on the other system (Struts) to observe how well the best-performing model (among the four models) performs in a different system. This experiment is performed with Tomcat as the training dataset and Struts as the test dataset and vice versa. The better-performing feature set recorded from RQ1 is used here.

Cross Validation
For cross-validation, in this study, we applied Stratified k-fold technique. As a resampling procedure, stratified k-fold can be employed to evaluate ML models on limitedsized data. Stratified k-fold technique needs a parameter, k, to determine the number of folds, i.e., k = 10, leads to 10-fold cross-validation. Thus, we divided the dataset into k groups, considered each group as a hold-out and the remaining k − 1 groups as the training set, and repeated this process k times. It is important to mention that we did not apply Stratified k-fold cross-validation technique for ensemble learning because it would be highly resource demanding in terms of time and computational power.

Phase 5: Testing and Evaluation
Precision is the ratio between true positives and all positives. Recall is the ratio of true positives our models correctly predict among all the positives. Accuracy refers to the weighted arithmetic mean of precision and inverse precision, and the weighted arithmetic mean of recall and inverse recall. The formulas of precision, recall, and accuracy are shown in Equations (1)-(3). With precision and recall, we tried to evaluate the performance of the metrics by tuning the 'average' method. For multi-class classification, we observed the weighted average and macro and micro performances, to be discussed in Sections 4 and 6. We aim to produce results that have lower False Positives.
In the above equations, TP, TN, FP, and FN refer to the counts of true positive, true negative, false positive, and false negative, respectively. Therefore, TP ∈ W, TN ∈ W, FP ∈ W, and FN ∈ W. The following section presents results obtained from model evaluations and answers our four research questions.

Results
Java classes that are affected by security vulnerabilities are treated as positive classes (labeled as 1) in both datasets. The unaffected classes are the negative classes (labelled as 0). Once we collected all the vulnerability information from the security logs, the Apache Tomcat dataset had 10,084 rows (8485 negative and 1719 positive) and Apache Struts2-core had 19,384 rows (19,328 negative and 44 positive) of relevant data. The raw dataset and the model implementation are available online https://github.com/palmafr/MDPIData2022 (accessed on 14 July 2022).
In the first experiment, we predicted whether the individual Java classes were associated with security vulnerabilities (yes or no). Then, in the following experiments, we attempted to predict the types and severity of the vulnerabilities.

Prerequisites: Model Tuning
We performed some ML model tuning. As written by the CFI team, overfitting is a common issue in Decision Trees [15], especially with a dataset with over 10,000 rows. Thus, the decision trees were pruned (as their leaves were snipped) to make the model not too specific for the training data. The solvers in Logistic Regression were changed to obtain a better performance. No changes were made to the Naive Bayes algorithm. Finally, for the XGBoost classifier, the number of estimators was set to 500. The objective was set to 'binary:logistic' in the case of vulnerable prediction and set to 'multi:softmax' in the case of severity and title prediction. After the prerequisites were achieved, the research questions were answered.

Performance of Individual Learners (RQ1)
For RQ1, the feature sets generated by the algorithms are shown in Table 2. Furthermore, features selected by RFE are displayed in rank order in Figure 6. The binary averaged metrics for predicting the vulnerable status of a class (0 or 1) are displayed in Table 3 for Tomcat and Struts performance, respectively. A stratified K-fold cross-validation technique with 10-folds was applied in this experiment for more accurate results.   The performance of the models is presented in Table 4. Since we handle the dataset imbalance issue by oversampling, precision is calculated using a micro average. Finally, the performance observed while predicting the title is presented in Table 5. The confusion matrices for the learners in predicting the vulnerable status using RFE generated feature set, are presented in Figures 7 and 8 and the same prediction using the SFS generated feature set, is presented in Figures 9 and 10    In answering RQ1, we can observe from Tables 3-5 that the XGBoost Classifier outperforms every other model. The precision ratings for XGBoost classifier on the best feature set is 92% in Tomcat and 28.5% in Struts for predicting vulnerable (0 or 1). The performance in predicting severity is better in Struts with a 53% precision and a 74.9% precision in Tomcat.
Finally, in predicting the title of the vulnerabilities, the model produced a 55.5% precision in Tomcat and 24.9% precision in Struts. The lower performance in Struts is observable, and it is because of fewer vulnerability data in the security logs. The class imbalance is presented in Figure 5, which shows the positive and negative classes present in both datasets. Beyond the class imbalance issue, XGBoost Classifier is the best among the individual learners, thus, answering RQ1. From the confusion matrices mentioned above, it is observed that the number of false positives is low in XGBoost Classifier 34 and 29, respectively, while maintaining the right number of true positives.
Summary on RQ1: From the results, it can be observed that the XGBoost classifier performs better than other learners in predicting the relevant target variables in both datasets. Thus, XGBoost is regarded as the best classifier.

Performance of an Ensemble Learner (RQ2)
In RQ1, we observed the performances of the individual learners. In RQ2, we further explored whether we could perform better using a Stacking classifier (a multi-level ensemble learning classifier). Precision was calculated using micro average methods.

Feature Set
For this experiment, we choose the feature set that, when fitted, results in better performance of the models. From the tables mentioned in Section 4.2, it is observed that both the feature sets have similar performances. Although that is the case, RFE fitted features tend to perform better than SFS fitted features in some cases. Hence, we select the RFE feature set for this experiment.

Stacking Classifier
Ensemble learning is achieved in this experiment using a Stacking Classifier. The Stacking classifier has two levels: Decision Tree, Logistic Regression, and Naive Bayes at level 0 and XGBoost Classifier at level 1. The performance metrics of the stacking classifier are shown in Tables 6 and 7 for Tomcat and Struts, respectively. Additionally, the confusion matrix for the stacking classifier in predicting the vulnerable status in both the datasets is presented in Figure 17. Figure 18 presents the confusion matrix for both datasets, predicting vulnerability severity. Finally, the AUC-ROC curves produced by the experiment are shown in Figure 19.   In RQ2, we employed the stacking ensemble learning technique to explore whether predictive performance can improve. From RQ1, we could set a performance benchmark. However, we can observe that ensemble learning did come close to the best-performing individual learner (XGBoost). This performance, however, might not be promising because, from the confusion matrix presented in Figure 17, it is clear that the stacking classifier predicts high instances of false positive. On the other hand, the confusion matrices of the XGBoost classifier, as seen in Figures 7, 8, 13, and 14, have almost negligible false positives and more precise predictions. It is important to note that due to the lack of resources, a proper fine-tuning of the Stacking classifier was not performed. Thus, it may be safe to say that the Stacking classifier may provide better results under different conditions. Summary on RQ2: We explored whether an ensemble learning technique can provide better and more precise predictions than individual learners. The results show that the Stacking classifier does not outperform the XGBoost Classifier (best learner) in predictive precision.

Cross-Version Prediction (RQ3)
This experiment explores the performance of the best performing model from Section 4.2 (i.e., XGBoost Classifier) as it attempts to predict vulnerabilities for one version with the remaining versions as the training set.
Once we split the dataset, Apache Tomcat had six unique train and test dataset groups for each target variable, while Apache Struts-2-core had ten. Similar to the experiments for RQ1 (as explained in Section 4.2), Stratified 10-fold cross-validation was employed with micro averaging for precision.
The results for Tomcat are presented in Table 8 and for Struts in Table 9. The confusion  matrices generated for the experiments relevant to RQ3 are presented in Figures 20-22 for  Tomcat, and Figures 23-27 for Struts, with vulnerable and severity as the target variable to be predicted. The confusion matrices for the title are presented in the GitHub repository as it has many classes (if presented here would not be readable). Finally, the AUC-ROC curves produced using Tomcat are presented in Figures 28-30 for Tomcat (all target variables) and Figures 31-33 for Struts (all target variables). The RQ3 is about observing the models' performances when they are subject to cross-version prediction. From Table 8 for Apache Tomcat, we can observe that the first experiment for predicting each target variable has an accuracy, precision, and recall of 100%. This is due to the lack of an actual vulnerable class in either the train or test dataset. Thus, the ML model predicts all of it to be 0 (i.e., not vulnerable). However, we see an uneven distribution of performance throughout the various train and test datasets. A similar argument applies for Struts (whose performances are displayed in Table 9), but in this case, since the test dataset has a positive class. Without enough positive classes in the training set, the model has a 0% precision and recall for most experiments. As mentioned earlier, Struts has fewer positive classes than Tomcat, making it challenging to predict with this dataset. Since Severity and Title are multi-class targets, their precision was averaged using 'micro-average', which is ideal for an imbalanced dataset such as ours.
Summary on RQ3: The results show the instability of prediction across different versions due to the dataset imbalance. The dataset imbalance renders some splits with no positive samples, which is less desirable for our experiments.

Cross-Project Prediction (RQ4)
RQ4 aims to explore the possibility of cross-project prediction while maintaining the optimal performance of the models. In the current context, the two projects for cross-project are Tomcat and Struts, respectively. This experiment also makes use of the feature set acquired from RFE. Moreover, cross-validation is performed using Stratified 10-fold crossvalidation. The results of this experiment are presented in Table 10. The confusion matrices produced in this experiment are presented in Figures 34 and 35, and the AUC-ROC curve for this experiment is shown in Figure 36.
The best learner (XGBoost) was used with the best feature set for this experiment. The results presented in Table 10 show that the best learner does not perform well with crossproject predictions. However, there is a slightly better performance predicting vulnerable status. As discussed earlier in this section, this is mainly due to the lack of a data standard between the two systems. There are different vulnerability titles and labels for their severity in both datasets. This proves to be a problem with cross-project prediction. However, this can be mitigated using a data normalization procedure before prediction.
Summary on RQ4: Thus, we can infer that the best learner (XGBoost Classifier) performs slightly better when trained on Struts and tested on Tomcat than vice versa. Cross-project is, however, not practical without a data normalization procedure during data manipulation.

Related Work
Over the past years, several studies have been conducted to identify software systems vulnerabilities or bugs. This section introduces studies with similar scopes such as vulnerability identification and prediction and identifying metrics useful in vulnerability prediction.
In their paper, Harer et al. [24] predicted vulnerabilities using control-flow graphs. Both simple and deep learning models were employed to predict the vulnerabilities. Moreover, the scope of their experiments was to detect vulnerabilities in systems written in C/C++. The source code was used to extract features for conducting these experiments [24]. Staying on deep learning algorithms, Pang et al. [25] used such an algorithm to predict vulnerabilities in Java-based Android applications. This experiment appeared to be more related to our current domain. However, we did not intend to use deep learning for predictions. Another similar factor is the use of gradient descent in their experiments for error handling. The feature selection for these experiments was made using a statistical algorithm. Their experiments were finally able to predict the vulnerable status of a class using deep learning [25].
A study conducted by Livshits et al. [26] helped to reduce security vulnerabilities using a non-ML technique. This study did not help predict vulnerabilities but instead helped detect them using source code metrics from Java classes. Using static analysis techniques, they could identify some of the most common security vulnerabilities such as SQL injection, cross-site scripting, etc., mainly due to human error in the source code. Finally, they identified 29 vulnerabilities in the most common java packages, and libraries [26].
Slightly deviating from the prediction of vulnerabilities, another study was proposed by Hammouri et al. [27], which predicted bugs in the source code. This study was proposed in 218, and it made use of three supervised ML algorithms, namely Decision Tree (DT), Naive Bayes (NB), and Artificial Neural Networks (ANNs). They were set to predict future defects using historical information on three unspecified datasets. The study concluded that the decision tree classifier outperformed the other ML models [27]. To analyse the impact of source code metrics on vulnerability occurrence, Gupta and Saxena [28] proposed a study in 2018. This study indicated the presence of more impactful object-oriented metrics such as CBO and LOC, as opposed to lesser impactful metrics such as DIT [28].
A similar study conducted by Goyal et al. [29] tried to establish a connection between object-oriented metrics and the occurrence of bugs in a system. The authors tested the effectiveness of their bug prediction using a KNN Classifier and a Linear regression model. This study concluded that Linear regression is a better performing model than the KNN. Accuracy was the performance evaluation metric in this study, and the highest recorded accuracy while running the model in both individuals and the combined datasets was 76.27% [29].
Shar and Tan [30] proposed to detect cross-site scripting and SQL injection (common vulnerabilities caused by the absence of input validation in a web application). The authors use static code attributes directly related to input validation and code patterns to classify instances as vulnerable or not [30].
Yosifova [31] explored the use of ensemble learning techniques in predicting vulnerabilities [31]. The author used a Random Forest Classifier and a Gradient Boost Classifier to conduct the experiments and concluded that the Gradient Boost classifier performed better. Bilgin et al. [32] conducted a study to predict the vulnerable status of function level source codes using Abstract Syntax Trees (AST) [32]. The experiment is conducted using the source code itself instead of source code metrics, which means that the text of the source code was tokenized and then subject to ML to predict vulnerable parts.

Discussion
This study aims to explore ways to predict security vulnerabilities using static source code metrics and to see whether there is any direct relation between metrics of the code and the occurrence of a vulnerability. The systems focused are two open-source systems: Apache Tomcat and Apache Struts2-core. The experiments conducted proved to be stepping points to identifying vulnerabilities in the source code. Thus, the relevant security log data were accessible.
However, there was a discrepancy with the data in Tomcat that the titles of the vulnerabilities were redundant. For example, there are three instances of 'Remote Code Execution' in the dataset with minor variations, making it into three classes instead of one. We did not handle this in our cleanup phase as we wanted to experiment with the titles as they were in the security logs. There is a chance that better results can be obtained when the titles are composed into categories of their vulnerabilities instead of unique vulnerabilities by themselves. This would also help better cross-project prediction, as discussed later in this section.
Aside from the provided results, two additional experiments were conducted to explore if under-sampling methods could produce better performance than over-sampling the dataset and to explore the viability of weighted performance metrics as a relevant standard for evaluation.
Confusion matrices and AUC-ROC curves are provided only for the answered RQs, i.e., the performance observations are the only deliverable for Under-sampling and Weightedperformance.

Performance of the Classifiers Using Under-Sampling
The results for the experiments from RQ1, performed using an under-sampled dataset, are provided in Tables 11-13 for vulnerable, severity and title respectively. Moreover, the results for the Stacking classifier predicting in an under-sampled dataset is presented in Tables 14 and 15 for Tomcat and Struts respectively. Furthermore, the results from the RQ3 variant using the undersampled dataset is provided in Tables 16 and 17 for Tomcat and Struts respectively. Finally, the results of the undersampled cross-project prediction experiment are presented in Table 18. From the tables mentioned above, it is observable that an undersampled dataset performs poorer than the oversampled dataset. This could be because in oversampling, the number of positive and negative instances are attempted to be equal by duplication and randomization. However, it is not the case with undersampling, which removes certain samples, leaving fewer data to train on than before. The best learner from RQ1 (as mentioned in Section 4.2) has a very low precision in Struts due to the lack of training samples. Below, we discuss the findings for undersampled dataset.  When the ML models are trained with only the negative samples to predict the positive instances, they evidently will not perform poorly. For Tomcat, the models are being trained with fewer positive samples, while in Struts, that count is near 0. From Table 11 a severe precision drop can be noticed in both the datasets.
In the case of multi-class target variables such as severity, it is observable from Table 12 that the accuracy, precision, and recall are the same. This is because the precision and recall values are 'micro-averaged' due to the data imbalance being handled poorly. Thus, the models did not perform better in predicting the severity of the vulnerabilities. As for the title of the vulnerabilities, it is observed from Table 13 that the performance is again severely affected. This is more realistic in this case, as there are twenty-three and seven unique titles for Tomcat and struts. However, it is observed from the table that the performance in Struts is poor compared to Tomcat's. It is due to the same rationale of 'training using negative samples with no positive samples, to predict positive samples'.  Table 15 that vulnerable has a precision of 100%, which is the case where the test dataset has only negative classes. The classifier predicted all the samples to be negative. It is observed that the classifier's performance in predicting the other target variables is the same as before.

Cross-Version Prediction
While predicting cross-version in an undersampled dataset, it is noted that the precision is zero or near zero most of the time due to the lack of sufficient training data to make precise predictions. This can be observed from the Tables 16 and 17 for Tomcat and Struts respectively. It is a similar case with severity and title as they are micro-averaged.

Cross-Project Prediction
Finally, the relevant predictions between both datasets would be the vulnerable status in the cross-project prediction. Table 18 shows that training on Struts and testing on Tomcat yields slightly better than the alternate experiment. As we discussed earlier, severity and title yield poorer performances cross-project due to the lack of a data normalization phase during data manipulation.
Thus, undersampling proved incompetent in our experiments and was restricted to the discussion.

Weighted-Performance of the Classifiers
In addition to the RQs and experiments with the undersampled dataset, another set of experiments with a weighted average as the averaging method for precision and recall were conducted. The results of these experiments do not contribute to the research questions but serve as mere experiments for optimization. The results relevant to RQ1 are provided in Tables 19-21 for experiments in predicting vulnerable, severity and title, respectively. The results relevant to RQ2 with weighted average are shown in Tables 22 and 23, respectively, for experiments conducted in Tomcat and Struts respectively. Furthermore, we have also recorded the results of cross-version experiments conducted using the weighted-average in Tables 24 and 25 for Tomcat and Struts respectively. Finally, the results for cross-project prediction have been presented in Table 26.
This method assigns weights to each class present in a target variable. In the case of precision, the weighted average is calculated by averaging each class to the number of right class predictions. In the present datasets, the number of negative classes is very high compared to that of positive classes, as seen earlier in Figure 5. Thus the number of right negative class predictions (True Negatives) boosts the average to show a good precision/recall. Supporting our previous statement, it is observable that the precision and recall values in these experiments are high or decent enough, averaging around 90-95%, but this does not represent the False Positives predicted by the model. The average is high because one of the classes is predicted more precisely than the others, which boosts the output.
Thus, the weighted average did not prove to be the right metric to evaluate the performance of our models.

Threats to Validity
This study focused on predicting security vulnerabilities in the two open-source projects. To minimize the threats to external validity, we considered seven versions of Apache Tomcat and eleven versions of Apache Struts2-core. However, other systems need to be analyzed to generalize our findings further. To minimize the threats to internal validity, we experimented with several ML models and two feature selection methods. However, the results reported in this study are further subject to improvement using more sophisticated ML models and feature selection techniques. Additionally, vulnerability predictions are made at the class level. However, a finer-method or line-level-prediction would be useful for the developers. To minimize the threats to the reliability and repeatability validity, the dataset and model implementations are made available online https://github.com/ palmafr/MDPIData2022 (accessed on 14 July 2022).

Conclusions
In summary, we can answer the initial question, "Are source code metrics good enough to predict software security vulnerabilities?" positively that static source code metrics are good enough to predict security vulnerabilities in the software systems.
We determined the best feature set by employing two feature selection techniques and examining the performance results. To achieve the best-performing model, experiments were carried out using four supervised ML models. Our results suggested that the XGBoost Classifier (XGB) delivered better results regarding predicting security vulnerabilities compared to Decision Trees (DT), Naive Bayes (NB), and Logistic Regression (LR). The presence of vulnerability is a binary value (yes/no). However, the severity has four classes (High, Important, Moderate, and Low). Due to its performance compared to the other models, the XGBoost classifier was selected as the best performing model among the four. This answered RQ1 on choosing the best classifier. RQ2 dealt with employing ensemble learning techniques and a Stacking classifier to hope for a better performance than the individual learner. However, this was unsuccessful based on the presented results.
Nevertheless, there is a chance that the Stacking classifier could perform better when its parameters are more fine-tuned. This was a challenge in the current experiment due to the limited resources restricting computing performance and training time. RQ3 explored the performance of the best learner in cross-version prediction, which resulted in variable prediction performance. However, we observed that the cross-version prediction was viable for the last iteration of experiments, i.e., train the model with 1 to (n−1) th versions and test on the n th version. Finally, RQ4 explored the possibilities of cross-project prediction using the best learner. As discussed in Section 6, the model would be more accurate if the data were normalized between the datasets.
The models and feature sets used in this experiment are subject to enrichment for better performance. More sophisticated feature selection methods can be applied. Additionally, fine-tuning the ensemble methods, simple multi-layer perceptron, and neural networks can be applied to make more accurate predictions. There was also a discussion to normalize vulnerability titles between different datasets to an OWASP (Open Web Application Security Project) standard. This would increase the predictive performance in cross-project predictions. Moreover, the results obtained using this experiment are distinct only to Apache Tomcat and Apache Struts2-core, and more experiments are required with different systems. Furthermore, we can also try to handle the imbalance of the datasets using a different over-sampling technique instead of random over-sampling. There is also a further scope to conduct experiments with closed-source systems with the required permission to distinguish the quality of the datasets.