Interpretable Software Defect Prediction from Project Effort and Static Code Metrics

: Software defect prediction models enable test managers to predict defect-prone modules and assist with delivering quality products. A test manager would be willing to identify the attributes that can inﬂuence defect prediction and should be able to trust the model outcomes. The objective of this research is to create software defect prediction models with a focus on interpretability. Additionally, it aims to investigate the impact of size, complexity, and other source code metrics on the prediction of software defects. This research also assesses the reliability of cross-project defect prediction. Well-known machine learning techniques, such as support vector machines, k-nearest neighbors, random forest classiﬁers


Introduction
Software testing demands a significant allocation of time, budget, and resources, which can become expensive when the source code contains multiple faults, necessitating additional retesting efforts.Additionally, test managers face the challenge of determining which modules to test as allocating the same level of effort to all modules may not be practical [1].Identifying defective modules becomes a significant task during test planning.This has led to the development of automated software defect prediction (SDP) processes that utilize various metrics derived from historical information.Consequently, defect prediction using machine learning techniques has emerged as a popular research area, aiming to automate the manual efforts involved in identifying different types of defects in software applications [2,3].
It is a daunting task to identify which attributes are good predictors for defect prediction, and many different research studies have been conducted in identifying the metrics to be used in the SDP models along with an efficient feature selection process [4,5].Also, the performance of defect classification models can be hindered by features that are redundant, correlated, or irrelevant.To overcome this problem, researchers have often utilized feature selection techniques to improve the SDP model's performance through either transforming the features or selecting a subset of them, aiming to enhance the classification models' effectiveness [6].
Test managers may face difficulty in placing trust in the results generated by predictive models due to their limited understanding of the internal workings of the systems.Before allocating testing resources to modules identified as error-prone [7], test managers need to comprehend the rationale behind the predictions.The development of an interpretable defect prediction approach, coupled with the ability to understand the static code metrics that are contributing to the identification of defect-prone modules, will allow test managers to provide sufficient data for training the model.By choosing relevant static code metrics obtained from previous projects of a similar nature, test managers can effectively facilitate testing arrangements for new applications.
Considering the aforementioned points, our research aims to contribute to the research community by addressing the following questions: RQ1 Can software defect prediction models generated from different projects with varied sample sizes yield consistent results?RQ2 Does it significantly impact the results if highly correlated independent features are removed from the set of independent variables when developing software defect prediction models?RQ3 Can we rely on prediction models developed using cross-project metrics compared to models built from individual projects?RQ4 Can we consistently interpret software defect prediction models after applying SMOTE techniques to balance unbalanced data?
The main contribution of this paper is the development of software defect prediction models that emphasize interpretability and the impact of various source code metrics, including size and complexity.One of the objectives was to verify how the defect prediction model performs when training with individual projects compared to training with cross-project information.Another major contribution of this work was to analyze the performance and interpretability of the SDP models when highly correlated independent features are removed compared to retaining highly correlated features.This work performed a comparison of the performance of the generation of SDP models with the original imbalanced dataset, and with balanced data after applying the oversampling of minority class technique.Finally, the best-performing model was interpreted using LIME and SHAP techniques.
This paper is organized into several sections.A literature survey on defect prediction using machine learning techniques is presented in Section 2. This is followed by the methodology used in this paper in Section 3. The developed SDP models and results are presented in Section 4. Section 5 summarizes the results analysis, presents a discussion of these, and considers threats to the validity of our work.Finally, the conclusions and future work are described in Section 6.

Literature Review
A considerable amount of literature has been published on the software defect prediction problem over the last few decades.When the NASA data defect repository became publicly available, this dataset became a popular reference for many researchers [8][9][10].
To build these SDP models, researchers have utilized various regression [11] and classification techniques, such as support vector machine [12], the k-nearest neighbor algorithm [13], random forest algorithms [14], deep learning methods incorporating artificial neural networks [15], recurrent neural networks [16] and convolutional neural networks [17], ensemble methods [18], and the transfer learning framework [19], etc.Like these studies, our SDP models will be built using existing machine learning algorithms to support our research questions.
Several studies have devoted time to cleaning the data due to low and inconsistent data quality in SDP research [2], and one of the criteria for cleaning data was handling outliers.One of the common techniques applied was removing outliers based on the interquartile range (IQR) [20,21].Outliers are considered as data points that are significantly different from the remaining data, and IQR refers to the values that reside in the middle 50% of the scores [22].
The main challenge often lies in identifying the modules that are prone to defects rather than focusing solely on the non-defective modules since most of the modules have a lower defect ratio compared to non-defective modules.Various techniques have been analyzed in research studies to address this issue, including random undersampling (RUS), random oversampling (ROS), and the synthetic minority oversampling technique (SMOTE) [23,24].These techniques aim to balance the data by giving equal weight to the minority class (defective modules) as well as the majority class (non-defective modules).In this research, the SMOTE technique was applied to balance the dataset as the maximum defect ratio in this dataset was 28%, which can tend to give higher accuracy with predicting non-defective modules, without identifying the defective modules.Since the efficacy of SMOTE lies in its capability to create new instances instead of merely duplicating existing ones, SMOTE has been applied in various studies in defect prediction problems [25][26][27].This oversampling approach has demonstrated considerable success in the literature in overcoming the difficulties associated with imbalanced classes.This approach utilizes an oversampling approach in which the minority class is over-sampled by creating "synthetic" examples rather than by oversampling with replacement.
Aleem et al. [8] explored various machine learning techniques for software bug detection and conducted a comparative performance analysis among these algorithms.The study aimed to assess the effectiveness of different machine learning algorithms in detecting software bugs.Once the machine learning techniques are identified, it becomes crucial to determine which features should be selected for developing the SDP models.In this regard, Balogun et al. [5] developed aggregation-based multi-filter feature selection methods specifically for defect prediction.
One challenge when implementing a new software solution is obtaining the necessary historical information from the software repository of similar projects.In some cases, the available historical data may not be sufficient for training purposes, particularly when dealing with similar types of projects.To address this issue, cross-project defect prediction (CPDP) has emerged as a topic of interest in recent SDP research [28].CPDP involves leveraging data from different projects to predict defects in a new project.Challenges with CPDP include variations in sizes, programming languages, development processes, etc.
It is often observed that multicollinearity can impact the performance of developed machine learning models.To overcome this limitation, various techniques, such as principal component analysis (PCA), ridge regression, etc., have been applied [29,30] in SDP studies.Yang and Wen [30] employed lasso regression and ridge regression in their study on developing SDP models, which improved performance.In this study, we aim to conduct a comparative study by removing the highly correlated features in classification models and evaluating the impact on model performance when retaining the correlated features.
In recent years, the need for explainable machine learning models has increased due to the growing need for explainable artificial intelligence (XAI).Gezici and Tarhan [31] developed an interpretable SDP model using the traditional model-agnostic techniques of SHAP, LIME, and EL5.They applied explainable techniques on the gradient boost classifier.Our research aims to develop interpretable SDP models using four different classifiers instead of a single one for comparability of explainability in different ML algorithms.
Jiarpakdee et al. [32] suggested that research practitioners need to invest more effort in investigating ways to enhance the comprehension of defect prediction models and their predictions.

Methodology
The implementation of the SDP model in this study is depicted in Figure 1.This process involves multiple steps, including data collection, feature selection, data preprocessing, model development using selected ML algorithms, and applying model-agnostic techniques.As it is important to interpret the model, the application of model agnostic technique has been highlighted in yellow.The detailed description of the SDP model methodology used in this study can be found below.

Data Collection
We selected five different files, namely, jm1, pc1, kc1, kc2, and cm1, from the PROMISE repository [33].These datasets incorporate McCabe and Halstead static code measure metrics.The projects list modules from various programs written in the C or C++ programming languages.Each of these selected files contains 21 independent variables, referred to as features in this study, and one target variable.The target variable indicates whether the selected module is defective or not.Some of the common measures include the total lines of code available in the program, McCabe's cyclomatic complexity, Halstead's effort, etc.The statistics of these selected datasets are presented in Table 1, followed by descriptions of each feature available in Table 2.

Feature Selection and Preprocessing Steps
The process began with loading the stored data from the PROMISE repository.Subsequently, an exploratory data analysis process was conducted, and the features were cleaned and selected using the techniques described in Table 3.The categorical values of the target variable "defects" were mapped to 0 or 1, representing non-defective and defective modules, respectively.Instances that violated referential integrity constraints were removed.The following conditions were applied: (a) The total line of code is not an integer number.(b) The program's cyclomatic complexity is greater than the total operators plus 1 [34].
(c) Halstead's sum of total operators and operands is 0.

Outlier removals
Outliers rely on any value that lies within the range of the 1st and outside the range of the 3rd quartile, respectively.Records within the range of (Q1 − 1.5 * IQR) and outside the range of (Q3 + 1.5 * IQR) were dropped.

4
Removal of duplicates Duplicated observations were taken out from the dataset.

5
Removal of highly correlated features except for module size, and effort metrics.
Calculated the correlation between independent features, and the attributes with more than 70% of correlation were removed in the first approach demonstrated in this study.
To validate the reliability of the cross-project defect prediction model, a separate dataset was created by merging all the selected files and stored in a Python dataframe named CP.A new field called "project" was introduced to identify whether the project information can influence the prediction.The projects were assigned numbers from 1 to 5 for identification purposes.
The feature selection process involves selecting a subset of the original dataset by removing irrelevant or redundant features from the original feature space.Feature selection algorithms have been used in various fields including in defect prediction [35][36][37][38].The selection of feature subsets significantly affects the complexity and performance of classification algorithms.The challenge with the feature selection process is that too many features may increase the computational cost of the classifier, while too few features may reduce the model performance [39].Feature selection methods have two major benefits for classification tasks, which are reducing the data dimensionality and maintaining or improving the performance of the classifier.
Two different approaches were employed during the feature selection process in this study.Both of these approaches cleaned the dataset by removing records with missing variables, duplicated entries, outliers, and implausible and conflicting feature values.The outliers were removed to train the dataset by removing the extreme values and reducing the noise for better performance.The technique for outlier removal was based on keeping the value within the IQR range of Q1 and Q3, as shown in Table 3.
In addition to the above cleaning criteria, the first approach involved removing highly correlated features from the datasets.Researchers attempted to select features from the original feature space that have a high dependency on the class labels and low redundancy with other features [35].To ensure the features have low redundancy with other independent features, except the field 'effort', the features that had a correlation value higher than 70% with other features, excluding the output variable of defective versus non-defective, were dropped.Figure 2 shows the pairwise correlation matrix of the independent variables or features of project pc1.Only the independent features were considered for comparing this correlation matrix as the objective was not to remove the features that are highly correlated with the output feature but rather to remove the duplicated information that is coming from two different variables.The reason behind considering removing the highly internally correlated features is to ensure that the features do not contain the same information.The impact of each feature in the model prediction should be interpretable and should be prominent for the prediction.From this correlation matrix, only the features that had an absolute value of 70% were retained for further comparison using this feature selection approach.The same technique is applied to other projects.The alternative approach followed the same cleaning steps as the first approach.However, they were retained during the SDP model development instead of removing the highly correlated features to identify if keeping the highly correlated features impacts the model performance.
The common features among all datasets were e, loc, l, ev(g), and lOComment when the highly correlated features were removed from each dataset.The selected features, along with the project names, are shown in Table 4.It can be observed that the maximum number of nine features, compared to the original 20 features, was available in the cross-project dataset.At the same time, kc2 and jm1 had the minimum number of selected features.

File Name
Selected Features kc2 loc, ev(g), l, e, lOComment cm1 loc, ev(g), l, e, lOCode, lOComment, lOBlank pc1 loc, v(g), ev(g), l, e, lOComment, lOBlank kc1 loc, ev(g), l,e, lOComment, lOBlank jm1 loc, l, i, e, lOComment Cross Project loc, v(g), ev(g), iv(g), l, i, e, lOComment, project To develop SDP models, the data need to be divided between training and testing such that enough samples are available for training and validating the model.Some of the common approaches are an 80:20 or 70:30 split, and cross-validation.In several existing defect prediction studies [40][41][42], 70:30 splits were considered.In this work, each dataset was split into 70% for training and 30% for testing, which would give us a balance between training and testing the model with sufficient data for training and a reasonable number of records for evaluation.Finally, since this approach has been used in earlier research, having a split of 70:30 would make it easier to compare and reproduce results across different studies.
To standardize the features by removing the mean and scaling to unit variance, Stan-dardScaler from the Python scikit-learn library was used [43].Furthermore, since the dataset was imbalanced, it was crucial to employ techniques to prevent the SDP model from being biased towards predicting the majority classes (non-defective value with 0) only.Consequently, in the training dataset, the synthetic minority oversampling technique (SMOTE) technique was applied to match the number of instances in the minority class (defective) to the number of instances in the majority class (non-defective), while the testing dataset remained unchanged for validation purposes.
The distribution of samples with defective versus non-defective modules in the finalized training dataset is presented in Table 5.For example, the kc2 dataset initially had 21 independent features with a total of 522 records or instances.After cleaning the data, the number of available instances in the same dataset was reduced to 198.These records were then split between training and testing with a 70:30 split, allocating 138 instances towards training with the remaining 60 instances kept for testing.In the training dataset, out of these 138 instances, 30 instances were classified as defective and the remaining 108 were marked as non-defective.Subsequently, SMOTE allowed the data to have an equal distribution of 108 records for both defective and non-defective modules, which made the training dataset oversampled with a total number of instances of 276.The SMOTE technique has not been applied in testing datasets as the goal is not to train the dataset with a balanced dataset but to test the overall performance without making any manipulation of the original data.

Applied Machine Learning Algorithms
After the data preprocessing steps, we selected widely used machine learning (ML) algorithms that have been applied in previous SDP studies on classification problems [8], namely, support vector machine (SVM), k-nearest-neighbors (KNN), random forest classifiers (RF), and artificial neural networks (ANNs).To improve the performance of each of these models, hyperparameter tuning [44,45] was performed using the grid search method [43] with 3-fold cross-validation on the training dataset.The predictors of the defect prediction models can be optimized by tuning the parameters of the algorithm.Depending on the algorithm applied, the hyperparameter tuning involves tuning the parameters of the algorithms to improve the performance of the developed models.For the SVM model, the parameters that were tuned were 'C', with values of 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 14, 'gamma', with values of 1, 0.1, 0.01, and 0.001, and the kernel parameter was tuned with linear, 'poly', 'rbf', and 'sigmoid'.The KNN algorithm depends on the parameter n_neighbors.This algorithm was optimized with the values of 1, 2, 3, 4, 5, 10, 15, and 20 for n_neighbors.The random forest algorithm has several parameters that can affect the performance of the algorithm.We have attempted the use of parameters of 'n_estimators', with corresponding values of 3, 10, and 12, 'max_depth', with values of 10, 20, 30, 40, and 100, while 'min_samples_leaf' was optimized with values of 1, 2, 4, and 8, and the "criterion" parameter was attempted with 'gini', 'entropy' and 'log_loss'.Finally, for the ANN algorithm, the epochs, batch size, and hidden layers were tuned to obtain the best performance.
Next, the paper attempts to analyze and compare various methodologies to tune the defect predictors.Once the models were developed, LIME was applied for local prediction on the best model for each of the projects and SHAP was applied for global explanations on the same classifiers.Finally, all the projects were combined into a single dataset to examine the impact of cross-project data and the obtained results were validated.Support vector machine (SVM) is a supervised ML algorithm that defines a hyperplane to separate the data most optimally, ensuring a wide margin between the hyperplane and the observations [8,46].K-nearest-neighbors (KNN) is a supervised ML algorithm that utilizes proximity to classify or predict the grouping of individual data points [47,48] due to its efficiency.Random forest classifier (RF) is a supervised ensemble ML algorithm that incorporates a combination of tree predictors, where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest [49,50].Artificial neural networks (ANNs) [15] are a collection of neurons, where each of these neurons or layers is vertically concatenated.An ANN model consists of an input layer and hidden layers and the prediction is performed in the output layer.

Model Interpretability
LIME and SHAP techniques were employed as both of these techniques have been popular in the field of machine learning explainability.Details of these techniques are explained below.

LIME (Local Interpretable Model-Agnostic Explanations)
The LIME [51] technique proposed by Ribeiro et al. constructs a surrogate sparse linear model around each prediction to elucidate the workings of the black box model within that local context [52].The LIME model can work for tabular data, text, and images.The software defect prediction models utilized in this study included tabular data, and LIME was applied to explain the tabular data on the best-performing model.The LIME equation [51] for interpretability can be expressed as follows: In Equation ( 1), the term ξ(x) represents the explanation for the prediction made by the SDP model for the instance x.It aims to select an interpretable explanation model for software defect prediction g from the class G, minimizing a combination of fidelity to the original SDP model f and complexity.The fidelity function L( f , g, π x ) measures how well g approximates f in the local neighborhood of x, while Ω(g) quantifies the simplicity of g.By minimizing this combined objective, LIME ensures the explanation ξ(x) faithfully represents f while being interpretable [53].
As explained in the LIME equation, The SDP model developed based on the random forest model would generate a LIME explanation for a specific record or instance that would start with randomly generated instances by perturbing the data surrounding the instance of interest [54].Next, LIME uses the black box model, which would be the SDP model generated based on the random forest example that was taken in this scenario to generate predictions of the generated random instances.Afterward, LIME constructs a local regression model using the generated random instances and their generated predictions from the black box model or the given random forest model used in this example.Finally, the coefficients of the regression model indicate the contribution of each metric to the prediction of defective or non-defective modules.

SHAP (SHapley Additive exPlanation)
Developed by Lundberg et al. [55], SHAP corresponds to the idea of Shapley values for model feature influence scoring [55].The Shapley value corresponds to the average marginal contribution of a feature value over all possible coalitions [51,56].SHAP can quantitatively explain the prediction of a machine learning model.The Shapley value is a mathematical theory used to determine the contribution of game participants to the game results.It is a method to calculate the contribution of each eigenvalue to the predicted value, which is expressed by the formula: Equation ( 2) represents the prediction made by the SDP model, where f (z) is the output based on the input features z, and f (z ) is the output of a simpler linear model.The term ∑ M j=1 θ j (z j − z j ) captures the difference between the predictions of the SDP model and the simpler linear model, with θ j representing the Shapley value of each feature and z j − z j representing the deviation of the feature values from a reference point.
Equation ( 3) calculates the Shapley value φ i (v) for a specific feature i, considering all possible subsets S of features.It quantifies the marginal contribution of feature i to the difference in predictions made by the model v for different subsets of features.
Together, these equations illustrate how the Shapley values are used to explain the contribution of individual features to the predictions of the SDP model, shedding light on its working principle and factors influencing the output results.

Evaluation Strategy
This work verified the performance of the SDP models using effective evaluation strategies for classification models in the literature using precision, recall, the F1-score, accuracy, and the AUC score.For the final measurement before applying model-agnostic explanations, the best performing SDP models evaluated based on accuracy and the AUC score were selected.The selected evaluation metrics are widely used in the literature for the evaluation of classification models' performance [8].
Accuracyis the percentage of correctly classified results [8].Accuracy can be calculated using Equation (4).
Precision measures the accuracy of positive predictions made by the model.It is the ratio of true positives (correctly predicted positive instances) to the sum of true positives and false positives (incorrectly predicted positive instances), whereas recall measures the ability of the model to correctly identify all positive instances.It is the ratio of true positives to the sum of true positives and false negatives (positive instances incorrectly classified as negative).The F1-score is the harmonic mean of precision and recall.It provides a single score that balances both precision and recall.
The area under the receiving operating characteristics (ROC) curve (AUC score) [46] is a measure of how well a parameter can distinguish between two classes (defective/nondefective).The "true positive rate" (TPR) is the proportion of instances labeled as defective that were correctly predicted, and the "false positive rate" (FPR) is the proportion of instances labeled as non-defective that are incorrectly predicted as defective.The higher the AUC score, the better the model's performance at distinguishing between the positive and negative classes; the objective is to obtain an AUC score of greater than 0.50.
Finally, to compare the performance of defect prediction using cross-project metrics with individual project metrics, the average accuracy and AUC scores were calculated.The differences between the average scores and the cross-project defect prediction models were examined for each classifier.

Results
This section presents the results in the order of the software defect prediction model based on an original dataset with a comparison of retaining highly correlated features and removing highly correlated features without attempting to adjust the imbalanced dataset.

Performance of Machine Learning Algorithms on Original Dataset
Table 6 presents the results obtained from the original dataset while keeping the majority of the features, and Table 7 shows the performance of the SDP models on the original dataset on the reduced features using a removal of highly correlated features selection strategy.
Both of the tables show the project name, the machine learning techniques applied, and the evaluation metrics of accuracy, precision, recall, F1-score, and AUC, respectively.The numbers in bold format show the highest value of each of the evaluation metrics for each of these selected projects.
When all the features were retained in the datasets, the accuracy of the models was in a range between 67% and 96%.The maximum precision was 67% when the KNN algorithm was applied in the jm1 project.The jm1 project had 4574 instances, which made this dataset larger compared to the other datasets except the cross-project (cp) dataset used in this study.The CP project had a precision value ranging from 21% to 34% depending on the algorithm applied.The precision score among all the projects had a wide range starting from a zero value to a maximum score of 67%.The highest recall value found when applied on these selected projects was 40% and this score was achieved when the SVM algorithm was applied in the cp project.The maximum F1-score was 28% with zero or very low scores found from the majority of the algorithms.The highest AUC score was 60%.Relatively, the ANN algorithm and RF algorithms performed better than the other algorithms for the selected datasets.The AUC score from most of the algorithms was around 50, apart from a few exceptions with random forest or the ANN algorithm.This does not give us confidence that the models are fully reliable as the majority of the predictions are probably biased toward predicting the non-defective module.Applying feature selection algorithms in the datasets by reducing the highly correlated features, as shown in Table 7, did not demonstrate a significant difference from the results obtained in Table 6, where all the features were retained.The accuracy score ranged from 72% to 95%.However, when the majority of the features were kept, the accuracy ranged from 67% to 96%.The lower bound for accuracy is higher compared to the dataset with reduced features.
The AUC score was slightly better for the pc1 project with reduced features when the random forest algorithm was applied, which showed an AUC score of 63% and an accuracy score of 95%.In terms of precision, except for the random forest model, which was able to detect all the positive instances on the cm1 project, the score varied for other projects, ranging from zero to 50%.The maximum recall value of the reduced features datasets was 29% compared to 40% when all the features were retained.
The maximum F1-score was 31% when the random forest algorithm was applied in the pc1 project with reduced features.On the other hand, the highest F1-score was 38% when the random forest algorithm was applied in the kc2 project with all features, but had a score of zero on the pc1 project when the random forest algorithm was applied.On the reduced features datasets, the random forest algorithm worked better compared to the other algorithms with relatively higher accuracy and AUC score in all projects.To see the interpretation of this random forest model on the original dataset with reduced features, the LIME technique was applied on a single instance of the pc1 project to illustrate the local explanation, whereas SHAP was applied on the same project when the random forest algorithm was applied.The numbers in bold represent the highest score in each of the evaluation metrics by projects.
Figure 3 demonstrates the local and global interpretation of the pc1 project when the random forest algorithm was applied to the dataset with reduced features.Figure 3a demonstrates the local interpretation by the LIME technique for the selected instance.The LIME model predicted the outcome of this instance as 0, with a confidence of 77% of this instance not being defective and 23% probability of this instance being defective.The orange color shows the probability of being defective, and the blue represents the probability of being non-defective.The right-hand side of the picture shows the values of loc (line of code), l (length), LOComment, and IOBlank are contributing towards the module being defective, whereas the cyclomatic complexity, essential complexity, and effort values are contributing to the model for predicting this instance as defective.Since the total feature ranking of the model as non-defective is higher, this instance is considered as 0 or non-defective.
Figure 3b shows the global explanation of the same project when SHAP was applied.This graph shows IOBlank makes the most contribution towards the model followed by loc, effort, length, and the other features.The blue and red color distribution shows that there was no bias towards predicting defective versus non-defective modules.

Performance of Machine Learning Algorithms on Balanced Dataset
Table 8 presents the results obtained from the dataset after the oversampling technique SMOTE was applied while keeping the majority of the features, and Table 9 shows the performance of the SDP models on the balanced dataset where highly correlated features were removed.
Both of the tables show the header table information of the project name, the machine learning techniques applied, and the evaluation metrics of accuracy, precision, recall, F1score, and AUC, respectively, as demonstrated in the model performance on the original datasets.When all the features were retained in the datasets, the accuracy of the models was in a range between 61% and 96%.The maximum precision was 47% when the KNN algorithm was applied in the kc2 project.The maximum recall value was 71%, which is much higher than the recall value that was found in the original dataset.The F1-score ranged from 16% to 53%, whereas in the original dataset, the maximum F1-score was 28%, with zero or very low scores found from the majority of the algorithms.The highest AUC score in the balanced dataset was 77% for the pc1 project when the ANN algorithm was applied, whereas the maximum AUC score on the all features' original dataset was 60%.The performance of the algorithms varied.For different projects with varying numbers of instances, the applied algorithms performed differently.The numbers in bold represent the highest score in each of the evaluation metrics by projects.
Applying feature selection algorithms in the datasets by reducing the highly correlated features, as shown in Table 9, did not demonstrate a significant difference from the results obtained in Table 8, where all the features were retained.The accuracy score ranged from 56% to 92%.However, when the majority of the features were kept, the accuracy ranged from 61% to 96%.
The minimum AUC score was 53% and the highest score for this metric was 71%.The minimum precision score was 18% and the maximum precision score was 38%.For the original dataset, several algorithms showed zero values for precision, which means the accuracy of predicting defective instances of the model was not satisfactory in the original dataset.
The recall has all positive values ranging from 15% to 71%.The F1-score also shows all positive values, unlike the original dataset, where a couple of the algorithms returned zero scores on these projects.
Table 10 demonstrates the final evaluation by summarizing the content obtained from Tables 8 and 9.We considered accuracy and the AUC score before selecting a model for applying the model-agnostic technique.Although having a high AUC score does not guarantee high value for precision, recall, and the F1-score, we note from this table that when the AUC was high, the accuracy, precision, recall, and F1-score were in an acceptable range in the balanced dataset with non-zero values.The numbers in bold represent the highest score in each of the evaluation metrics by projects.
In Table 10, the projects are listed in order of the dataset size where the number of instances available in the cleaned dataset are as shown in Table 5.The numbers in bold format show the highest accuracy score and the AUC score of each of the projects, whereas the highlighted yellow fields represent the model where the AUC score was highest for the selected project.Since this was an imbalanced dataset, rather than prioritizing the accuracy score, we considered the classifiers with the highest AUC score as the best-performing model for the referred project.For instance, when the cm1 project with all features was considered, a higher accuracy score was observed in the ANN classifier with a value of 89% compared to an accuracy value of 64% in the KNN classifier.However, we considered the KNN classifier as the best performing model, as the AUC score was highest among all the classifiers, with a value of 68%, as our primary goal was to identify the defective modules rather than detecting non-defective modules only.
It appears in both cases of selecting all features versus selecting reduced features that the KNN algorithm performed well for relatively smaller projects, namely kc2 and cm1.Also, it is worth noting that the KNN model achieved a higher accuracy score of 75% in the cross-project model, surpassing the average accuracy of the independent project scores of 74%.On the other hand, the obtained AUC score of 56% applied in the same classifier can be treated as relatively poor.Taking into consideration the overall AUC and accuracy, the ANN model outperformed the other classifiers in the cross-project models.
The top half of this Table 10 illustrates the accuracy and AUC score of each of the projects on the selected machine learning algorithms.These algorithms have been applied in the full-feature datasets except for the LocCodeAndComment feature.The bottom half of the figure followed the same strategy except that the SDP models were created on the reduced-feature datasets, which were obtained after removing the highly correlated independent features.The row denoted "Average" calculates the average of the accuracy and AUC score for each of the projects on the SVM, KNN, RF, and ANN models.The average score has been compared with the cross-project(CP) SDP model scores.
Table 10.Comparison of the performance of each project along with an average of individual project scores with cross-project scores for the approach selected with the most features and reduced features.The SVM model performed better on the kc1 project, which consists of a dataset with 718 instances available after filters were applied.Compared to the other datasets, this one is considered mid-size (Table 5).

Projects
In terms of the cross-project datasets, the ANN model developed on the reduced features achieved a slightly higher AUC score of 61% compared to 58%.For the pc1 project, which had mid-sized samples, the ANN classifier showed higher accuracy, with values of 95% and 92% for the all features and reduced features datasets, respectively.Although for the pc1 dataset, the same ANN model with all features showed the highest AUC of 77%, there was a reduction in the AUC score, with a value of 68% on the reduced feature dataset.
Figure 4a demonstrates the cross-project defect prediction model developed using SHAP on the ANN model incorporating all features on the balanced dataset.The highest ranking of the mean SHAP value for this project is for branchcount followed by IOBlank.This model has all the features, but the impact of the attribute effort "e" is in the lowest position compared to the other source code and size metrics.In the same ANN classifier with reduced features, SHAP was applied to provide the global explanation, as shown in Figure 4b.This figure illustrates that in the SDP model constructed with the ANN classifier, the "loc" attribute makes the highest contribution towards the prediction outcome followed by cyclomatic complexity and intelligence."Effort" is considered a predictor that has more importance than program length and lines of code, but less than program complexityrelated features.Additionally, the "project" field is shown to have the lowest importance.This indicates that the project source, which includes cross-project information, does not significantly impact the development of this model.The LIME model predicted this record as non-defective with a confidence level of 71%.In this figure, the blue color represents the feature contribution towards being predictive as a non-defective instance compared to the orange color moving toward a defective module.The effort attribute was one of the important contributors in this prediction model; a value of less than −0.61 pushes the prediction towards a value of 0 (non-defective).Similarly, the cyclomatic complexity of the code was less than −64, and intelligence also contributes to the prediction of non-defectiveness.It seems that the attribute "intelligence", which determines the amount of intelligence presented in the program, was lower than the given threshold of −75 to be considered defective.This local prediction explains that the selected module had a lower value than the set threshold for intelligence, cyclomatic complexity, and intelligence to be considered defective.At the same time, the same prediction shows the probability of 29% of this module being defective as the loc, l, and LOComments are higher and ev(g) is less than the given threshold.The test lead can better reflect the outcome by observing the impact of each of the attributes on this prediction.This aligns with the logical concept that software that requires less effort and is not very complex may have relatively fewer bugs compared to a complex system.
The interpretability of the kc1 project with mid-sample size (based on the number of instances available) is examined in Figures 6 and 7.Both models interpreted with LIME and SHAP show the importance of the "effort" feature for this algorithm.The local prediction using the LIME method was able to explain the observation with 60% confidence that this module is not likely to be defective.In both models, the "loc" feature is less important compared to the effort metrics.We observe that there can be slight differences between these predictions-the global agnostic model provides generic information, while the LIME method offers insight into individual predictions.

Analysis and Discussion
In this study, we aimed to address multiple research questions and the findings are as follows: Regarding RQ1, the predicted models provided consistent results, with slight variations depending on the sample size and selected features.For example, the smallest sample size after cleaning the data was project kc2, and the KNN model performed the best in both feature selection approaches.Additionally, the cross-project data yielded good results for the ANN classifier, as deep learning models tend to perform better with larger datasets.It was observed that there were trade-offs between accuracy and AUC; project pc1 demonstrated the highest performance with an accuracy above 90% and an AUC score close to 70%.It was observed that the "effort" attribute was not necessarily the most influential of all the outcomes.The SHAP and LIME models applied to the developed classifiers showed that for cross-project defect prediction, "effort" ranked lower in terms of feature importance.However, for a few individual projects, "effort" ranked second in terms of importance.This finding aligns with the understanding that project size and complexity are crucial factors in defect prediction.
RQ2 investigated the impact of removing highly correlated independent features.It was found that this removal improved the average AUC score for the SVM, KNN, and RF models with a slight decrease in the ANN model.This indicates that removing correlated features had varying effects on model performance but did not significantly alter the overall outcomes.The SDP models became easier to interpret with the SHAP and LIME techniques, as having more features would show blank values for features that did not contribute to the predicted outcomes.Although the SVM and KNN models' accuracy decreased slightly with reduced features, this was compensated by a relatively higher AUC score.
To address RQ3, we compared the cross-project (CP) model score with an average score from the models built from individual projects.It appears that the CP models performed slightly lower than the individual projects' average performance.However, the performance did not go down drastically, and the results were very close to the average score.For instance, the ANN model with reduced features shows the average accuracy and AUC score as 76% and 61%, whereas the cross-project model had accuracy and AUC values of 70% and 61%, respectively.CP can be useful when historical information is not available for similar projects.
Finally, RQ4 focused on interpreting the predictions even after applying oversampling using SMOTE.It was found that both SHAP and LIME successfully explained the predictions, providing insights into the contributing features.
For the feature selection algorithm, we used a threshold of 70%.Further experiments can be conducted to find the best threshold for selecting features when multicollinearity exists.
We experimented with both imbalanced data and after applying the SMOTE technique for creating synthetic data for the minority class.Training the model with more balanced data would enhance confidence in generalizing the model.Also, for comparison, five projects were merged from the same source to observe the impact of cross-project defect prediction.This can be extended by using data from various sources to bring more variation to the dataset.

Conclusions and Future Work
This study aimed to develop software defect prediction models with a focus on interpretability, using individual projects and by combining the individual projects with a cross-project dataset.Feature selection was applied by reducing and retaining the highly correlated data for developing comparative studies.Test managers may need to work with a subset of features rather than having all the features due to not having historical information on various metrics.Our research indicates that removing the multicollinear features yielded consistent results.
The findings indicate that the cross-project defect prediction model does not significantly compromise performance.While the average of the individual projects generally achieved better AUC scores, the results were comparable to the scores of the cross-projects.This implies that when historical information for an exact similar project is unavailable, users can still utilize the cross-project dataset for predicting defective outcomes in new software applications.
Model-agnostic techniques, such as SHAP and LIME, were employed to explain the models.Despite the use of SMOTE for data oversampling, the SHAP model demonstrated unbiased predictions and the data re-balancing approach did not affect the interpretability.Regarding feature importance, both the feature selection processes provided interpretations that aligned with expectations.Occasionally, there were slight differences between local predictions and global predictions.This can be attributed to the fact that local predictions consider individual records and certain features may contribute differently to specific outcomes, resulting in slight variations from the global prediction.Notably, it was possible to predict defects using only five attributes in one of the datasets.
In the future, there is room for research to be conducted on applying PCA and other methods to tackle multicollinearity issues and to consider their impact on interpretability.
In conclusion, it was possible to develop a defect prediction model without bias.It appears that both source code and effort metrics can be important for defect prediction.
(a) LIME model-dataset contains reduced features.(b) SHAP model-dataset contains reduced features.

Figure 3 .
SDP model interpretation on the original dataset of project pc1 with reduced features.
(a) SHAP model-dataset contains all features.(b) SHAP model-dataset contains reduced features.

Figure 4 .
SHAP applied on cross-project dataset developed using ANN model.

Figure 5
Figure5demonstrates the interpretation of a local observation of the same ANNbased model using LIME.The LIME model predicted this record as non-defective with a confidence level of 71%.In this figure, the blue color represents the feature contribution towards being predictive as a non-defective instance compared to the orange color moving toward a defective module.The effort attribute was one of the important contributors in this prediction model; a value of less than −0.61 pushes the prediction towards a value of 0 (non-defective).Similarly, the cyclomatic complexity of the code was less than −64, and intelligence also contributes to the prediction of non-defectiveness.It seems that the attribute "intelligence", which determines the amount of intelligence presented in the program, was lower than the given threshold of −75 to be considered defective.This local prediction explains that the selected module had a lower value than the set threshold for intelligence, cyclomatic complexity, and intelligence to be considered defective.At the same

Figure 5 .
Figure 5. Demonstration of local interpretation of selected records from cross-project data when ANN was applied on reduced-feature-based model.

Figure 6 .
Figure 6.Model interpretation of SDP model generated from kc1 file with LIME on SVM classifier.The green color bar shows the features that are contributing to increasing the probability of this instance being defective, and red represents the feature's probability of not being defective.

Figure 7 .
Figure 7. Model interpretation of SDP model generated from kc1 file with SHAP on SVM classifier.

Table 2 .
Individual field or feature description of the selected projects from the PROMISE dataset.

Table 3 .
Data cleaning criteria applied on PROMISE dataset.

Table 4 .
Features selection after removing highly correlated attributes.

Table 5 .
Distribution of defective vs. non-defective samples before and after applying SMOTE in the training dataset.

Table 6 .
Performance of the evaluation metrics on the original dataset with retention of highly correlated features.
The numbers in bold represent the highest score in each of the evaluation metrics by projects.

Table 7 .
Performance of the evaluation metrics on the original dataset with reduced features through the feature selection process of removing highly correlated features.

Table 8 .
Performance of the models after applying SMOTE on datasets that retained highly correlated features.

Table 9 .
Evaluation metrics after SMOTE was applied in the training dataset after removing the highly correlated features.

With 20 Common Features
The numbers in bold represent either the highest accuracy or AUC for the referred project among the applied algorithms.The numbers highlighted in yellow represent the models with the highest AUC score.