Empirical Study of Software Defect Prediction: A Systematic Mapping

: Software defect prediction has been one of the key areas of exploration in the domain of software quality. In this paper, we perform a systematic mapping to analyze all the software defect prediction literature available from 1995 to 2018 using a multi-stage process. A total of 156 studies are selected in the ﬁrst step, and the ﬁnal mapping is conducted based on these studies. The ability of a model to learn from data that does not come from the same project or organization will help organizations that do not have sufﬁcient training data or are going to start work on new projects. The ﬁndings of this research are useful not only to the software engineering domain, but also to the empirical studies, which mainly focus on symmetry as they provide steps-by-steps solutions for questions raised in the article.


Introduction
Defect Prediction in Software (DeP) is the process of determining parts of a software system that may contain defects [1].Application of DeP models early in the software lifecycle allows practitioners to focus their testing manpower in a manner that the parts identified as "prone to defects" are tested with more rigor in comparison to other parts of the software system [2].This leads to the reduction of manpower costs during development and also relaxes the maintenance effort [3,4].DeP models are built using two approaches: First, by using measurable properties of the software system called Software Metrics and second, by using fault data from a similar software project.Once built, the DeP model can be applied to future software projects and hence practitioners can identify defects prone parts of a software system.
Initially, the models used for DeP were built using statistical techniques, but to make the model intelligent, i.e., capable of adapting to changing data in such a manner that as the development process matures the DeP model also matures; it is important that learning techniques are used while building DeP models.In the past, a large number of DeP studies have made use of machine learning techniques.For instance, some authors used Decision Tree (DT) [5] and Artificial Neural Network (ANN) [6] to build DeP models using Object-Oriented (OO) metrics [7].
Although a large number of studies have been conducted in this domain, more efforts need to be explored.New algorithms are developed periodically, keeping in mind the shortcomings of the previous ones.The development processes also keep changing; this affects the type of data being maintained for a software project (defect data).Such factors require that new modeling techniques be built and benchmarked with the previous work to evaluate their efficacy.In short, the following challenges are encountered: 1.
Difficulty in separating correct theories from the incorrect ones when the purpose of evaluation is practice.2.
Difficulty in the identification of quality literature from quality lacking literature.
To identify lacking areas and facilitate the building and benchmarking of new DeP models it is important to study the previous works in a systematic manner and extract meaningful information.Literature surveys and systematic mappings have attempted to answer various questions about DeP models in the past.The authors in Reference [6] study 106 papers published between 1991 and 2011 to identify software metrics that are important for software defect prediction [8][9][10][11].They answered five research questions corresponding to various aspects of the software defect prediction problem with their focus on data used in model building.The authors [12] examined 36 defect prediction studies between 2000 and 2010.The focus is on independent variables and techniques used to build models.They also paid attention to whether feature selection on independent variables has been done or not.The authors in Reference [13] examined 148 studies between 2000 and 2009 with the focus on model building and performance evaluation techniques.Authors [14,15] reviewed 74 studies in 11 journals and seven conference proceedings.The focus of their study is on model building methods, metrics used, and datasets used.Other works can be seen in [16].
To the best of our knowledge, no systematic review, till date, brings all aspects of DeP together.To help researchers and practitioners, it is imperative that good quality literature surveys and reviews that9 are concerned with the literature available for DeP be conducted and their results made available to the software engineering community.In this systematic mapping, we address various questions, including the above-stated questions (Section 2.1).For this mapping, we collect all DeP studies between 1995 and 2018.
Remainder of the paper is organized as follows, in Section 2 describe the research questions addressed in this systematic mapping and the process for selection of primary studies, Section 3 gives the answers to the research questions identified in this work, Section 4 concludes this systematic mapping and also provides future directions extracted from this system mapping.

Review Method
The mapping method in this study is taken from Reference [1]. Figure 1 outlines the process diagram.
Symmetry 2019, 11, x FOR PEER REVIEW 2 of 28 building DeP models.In the past, a large number of DeP studies have made use of machine learning techniques.For instance, some authors used Decision Tree (DT) [5] and Artificial Neural Network (ANN) [6] to build DeP models using Object-Oriented (OO) metrics [7].Although a large number of studies have been conducted in this domain, more efforts need to be explored.New algorithms are developed periodically, keeping in mind the shortcomings of the previous ones.The development processes also keep changing; this affects the type of data being maintained for a software project (defect data).Such factors require that new modeling techniques be built and benchmarked with the previous work to evaluate their efficacy.In short, the following challenges are encountered: 1. Difficulty in separating correct theories from the incorrect ones when the purpose of evaluation is practice.
2. Difficulty in the identification of quality literature from quality lacking literature.
To identify lacking areas and facilitate the building and benchmarking of new DeP models it is important to study the previous works in a systematic manner and extract meaningful information.Literature surveys and systematic mappings have attempted to answer various questions about DeP models in the past.The authors in Reference [6] study 106 papers published between 1991 and 2011 to identify software metrics that are important for software defect prediction [8][9][10][11].They answered five research questions corresponding to various aspects of the software defect prediction problem with their focus on data used in model building.The authors [12] examined 36 defect prediction studies between 2000 and 2010.The focus is on independent variables and techniques used to build models.They also paid attention to whether feature selection on independent variables has been done or not.The authors in Reference [13] examined 148 studies between 2000 and 2009 with the focus on model building and performance evaluation techniques.Authors [14,15] reviewed 74 studies in 11 journals and seven conference proceedings.The focus of their study is on model building methods, metrics used, and datasets used.Other works can be seen in [16].
To the best of our knowledge, no systematic review, till date, brings all aspects of DeP together.To help researchers and practitioners, it is imperative that good quality literature surveys and reviews that9 are concerned with the literature available for DeP be conducted and their results made available to the software engineering community.In this systematic mapping, we address various questions, including the above-stated questions (Section 2.1).For this mapping, we collect all DeP studies between 1995 and 2018.
Remainder of the paper is organized as follows, in Section 2 describe the research questions addressed in this systematic mapping and the process for selection of primary studies, Section 3 gives the answers to the research questions identified in this work, Section 4 concludes this systematic mapping and also provides future directions extracted from this system mapping.

Review Method
The mapping method in this study is taken from Reference [1]. Figure 1 outlines the process diagram.The first step is to identify the need for conducting the systematic mapping and establishing a protocol for the mapping.After this, research questions that the systematic mapping attempts to address are formulated.Once the questions have been identified and evaluated, a search query is designed which is used to extract studies from digital libraries.The studies collected are then passed through a four-stage process:

Literature Review
Primary studies by definition correspond to the literature being mappings.To provide a strong mapping it is essential that selection of primary studies for mapping must be done carefully.While it is good that an exhaustive search is conducted for the selection of primary studies, in some cases it is not possible because of the number of primary studies available.In such cases the search criteria become important.For DeP studies, we can conduct an exhaustive search because the number of primary studies is not very large and since we are only concerned with those studies which are empirical in nature, the number shrinks further.We have selected the list of following digital libraries to perform our search: Springer Link 3.
Wiley Online Library 5.
ACM Digital Library 6.

Google Scholar
The Search String: A search string is the combination of characters and words entered by a user into a search engine to find desired results.The information provided to the search engine of the digital library directly impacts the results provided by it.To ensure that all the primary studies that our mapping plans to address are covered we need to be careful in the selection and placement of keywords used in the search string.
Search string.

software .( defect + fault ) . ( software metrics + object oriented metrics + design metrics )
Here, '.' corresponds to the Boolean AND operation, and '+' Corresponds to the Boolean OR operation The search string was executed on all six electronic databases mentioned above and the publication year was restricted to the range 1995-2018.The literature hence obtained was processed further using a carefully designed inclusion-exclusion criteria and quality analysis criteria as described in the following sections.
The Inclusion-Exclusion Criteria: The search results obtained by execution of the search string may still fetch some primary studies that either do not add value to the mapping or do not fall within the purview of what the mapping aims to accomplish.
Once all the primary studies have been obtained, a carefully designed inclusion-exclusion criteria are applied to the resultant set in order to eliminate entities that do not match the objectives of the mapping.
Inclusion Criteria: • Empirical studies for DeP using software metrics.

•
Studies that provide empirical analysis using statistical, search-based and machine learning techniques.
Exclusion Criteria: • Literature Reviews and Systematic Reviews.

•
Studies that do not use DeP as the dependent variable.

•
Studies of non-empirical nature.

•
If two studies by the same author(s) exist, where one is an extension of the previous work the former is discarded.But if the results differ, both are retained.
• Review Committee: We formed a review committee that comprises of two Assistant Professors and two senior researchers to mapping in order to rate all primary studies captured from the search.All studies were examined independently on the basis of the criteria defined in The Inclusion-Exclusion Criteria.Application of the inclusion-exclusion criteria resulted in 98 studies out of the total 156 studies being selected for quality analysis.

•
Quality Analysis: Assessing the quality of a set of primary studies is a challenging task.A quality analysis questionnaire is prepared as part of this systematic mapping to assess the relevance of studies taking part in this mapping.The questionnaire takes into consideration suggestions given in Reference [5].A total of 18 questions, given in Table 2, together form the questionnaire and each question can be answered as "Agree" (1 point), "Neutral" (0.5 points) and "Disagree" (0 points).Hence, a study can have a maximum 18 points and minimum 0 points.

•
The same review committee enforces the quality analysis questionnaire.Data Extraction: To extract meaningful information for each study such that the research questions can be answered, the data extraction method should be objectively defined.We created the data form shown in Figure 2 and filled it for each of the primary study (SE) that passes the quality analysis criteria.The review committee filled the data extraction card and any conflicts raised during the data extraction process were resolved by taking suggestions from other researchers.The resultant data was converted to an Excel (.xlsx) Workbook for usage during the data synthesis process.Data Synthesis: Data synthesis involves the accumulation of facts from the data collected during the data extraction process to build responses to the research questions [5].In the data synthesis process, we identify SEs that have similar viewpoints regarding a particular research question.This allows for reaching conclusive answers to the research questions identified as part of the systematic mapping.Details of the techniques used to answer the selected research questions are given below: 1.
RQ1: To answer this question we use a bar chart that shows the number of studies using machine RQ7: This question uses bar charts and tables to show the distribution of studies that address security related defects and vulnerabilities.8.
RQ8: This question does not use any diagramming method.9.
RQ9: This question does not use any diagramming method.

Description of SEs
This study is very useful for developers.Upon the application of quality analysis criteria, we have selected 98 primary studies that we call primary studies (SEs) using a mix of learning techniques, including machine learning, search-based techniques, statistical techniques and some hybrid frameworks.Most of the studies made use of public domain data and standard performance measures like accuracy, error rate, precision, recall and ROC analysis.

Quality Assessment Questions
Table 3 shows the results of the quality analysis.It shows the percentage and number of SEs that Agree to, Disagree to, and stands Neutral towards each quality question.From the results of the quality assessment procedure, it is clear that QQ4, QQ8, and QQ18 have been answered negatively by most of the study.This is an important finding and strongly suggests that the available literature lacks the use of multi-co-linearity analysis procedures, statistical tests of significance and machine learning technique vs. statistical technique comparison.Also, the results of QQ5, QQ10, QQ11, and QQ17 are well distributed on the positive and negative end.We also found out that the abstracts of most studies were partially informative about what will the reader extract by reading the entire study.The majority of the studies have not given any limitations of their work clearly, but by reading the entire manuscript some limitations can be extracted.
The scores obtained by SEs in the quality analysis step are separated into four groups, Low, Average, High and Very High.The SEs that obtained High and Very High scores were considered for the mapping.Table 4 summarizes the percentage of SEs and the number of SEs in each of the four categories.We also assign a unique identifier to each SE.Table 5 presents the SE identifier along with its reference number.The identifiers allow for easy interpretation of results as they are used in all subsequent sections.Table 6 provides the quality analysis score for all studies which obtained quality scores in "very high" category.A total of 22 studies obtained at least 13 points in the quality analysis process.The primary study identifier along with the quality score is given.details of the studies published in journals and conferences of high repute.It can be noted that IEEE Transactions on Software Engineering has the highest percentage of selected studies (11.22%) followed by the Journal of Systems and Software (10.20%).

RQ1: What Techniques Have Been Used for Building DeP Models?
In this section, we elaborate on the distribution of model building techniques used in DeP studies.As shown in Figure 4 the most used learning method is the Decision Tree, 44 out of 98

RQ1: What Techniques Have Been Used for Building DeP Models?
In this section, we elaborate on the distribution of model building techniques used in DeP studies.As shown in Figure 4 the most used learning method is the Decision Tree, 44 out of 98 studies selected for this systematic mapping use some variant of the decision tree method.Techniques like C4.5, J48, CART, and Random Forest come under the decision tree class.Bayesian learners [65] i.e., Naïve Bayes, Bayes Net, etc. have been used by 39 studies.Regression, Discriminant analysis and Threshold based classification have been performed in 35, 8 and 4 studies respectively.Both support vector machine and neural network have been used in 21 studies.
studies selected for this systematic mapping use some variant of the decision tree method.Techniques like C4.5, J48, CART, and Random Forest come under the decision tree class.Bayesian learners [65] i.e., Naïve Bayes, Bayes Net, etc. have been used by 39 studies.Regression, Discriminant analysis and Threshold based classification have been performed in 35, 8 and 4 studies respectively.Both support vector machine and neural network have been used in 21 studies.

3.3.RQ2: What Are the Different Data Pre-Processing Techniques Used in DeP Models?
Research question #2 summarizes all the techniques that deal with data pre-processing.This study answers this question in two parts; the first part summarizes all methods that are used to analyze the degree of multi-co-linearity among features and the second part deals with the techniques used to select relevant features from the dataset.

What Are the Different Techniques Used for Multi-Co-Linearity Analysis in DeP Studies?
Analysis of data multi-co-linearity is an important step of data pre-processing, because it ascertains that the data being used for the study is not collinear.Collinear data does not add value to the study as one variable can be used to predict another variable hence reducing the independence among variables.This affects the final results in two ways.
The predictive capability of independent variables cannot be assessed effectively as at least one independent variable 'x' is highly dependent on another independent variable 'y'.If the variable 'y' a significant predictor of defects, then the independent variable 'x' can also present itself as a significant predictor even if 'x' by itself does not carry any meaningful information that is different to the information available with 'y'.
The final results reported by the prediction model are affected.Table 8 shows all studies that apply multi-co-linearity procedures and the corresponding procedure applied.Only seven out of 98 studies have stated in their work that they make use of multi-co linearity procedures.

Primary Study VIF Condition
Figure 5 shows the graphical representation of the distribution of multi-co-linearity analysis techniques used across seven studies.We can see that the Condition Number method is the most  Analysis of data multi-co-linearity is an important step of data pre-processing, because it ascertains that the data being used for the study is not collinear.Collinear data does not add value to the study as one variable can be used to predict another variable hence reducing the independence among variables.This affects the final results in two ways.
The predictive capability of independent variables cannot be assessed effectively as at least one independent variable 'x' is highly dependent on another independent variable 'y'.If the variable 'y' a significant predictor of defects, then the independent variable 'x' can also present itself as a significant predictor even if 'x' by itself does not carry any meaningful information that is different to the information available with 'y'.
The final results reported by the prediction model are affected.Table 8 shows all studies that apply multi-co-linearity procedures and the corresponding procedure applied.Only seven out of 98 studies have stated in their work that they make use of multi-co linearity procedures.
Figure 5 shows the graphical representation of the distribution of multi-co-linearity analysis techniques used across seven studies.We can see that the Condition Number method is the most popular in DeP studies, four studies make use of this method.VIF (variable inflation factor) and PCA (principal component analysis) were applied in two studies each.
popular in DeP studies, four studies make use of this method.VIF (variable inflation factor) and PCA (principal component analysis) were applied in two studies each.The reports corresponding to RQ2.1 show that only 7.14% of all studies make use of multi-co-linearity analysis techniques.This is a very small number and reflects the fact the very little emphasis has been laid on the analysis of multi-co-linearity among data elements present in datasets used for DeP studies in the past.The reports corresponding to RQ2.1 show that only 7.14% of all studies make use of multi-co-linearity analysis techniques.This is a very small number and reflects the fact the very little emphasis has been laid on the analysis of multi-co-linearity among data elements present in datasets used for DeP studies in the past.popular in DeP studies, four studies make use of this method.VIF (variable inflation factor) and PCA (principal component analysis) were applied in two studies each.The reports corresponding to RQ2.1 show that only 7.14% of all studies make use of multi-co-linearity analysis techniques.This is a very small number and reflects the fact the very little emphasis has been laid on the analysis of multi-co-linearity among data elements present in datasets used for DeP studies in the past.

What are Different Techniques Used for Feature Sub Selection in DeP Studies?
In this section, we determine the number of studies that state the use of feature sub-selection techniques and what feature sub-selection techniques are widely used.It is important that we conduct feature sub-selection on the input data before it is supplied to the learning algorithm, because the data could contain redundant and irrelevant features.Out of the 98 studies selected for this systematic mapping, there are 44 studies that use feature sub-selection methods, i.e. exactly 50 percent of the studies use feature sub-selection methods.Figure 6 shows the distribution of feature sub-selection techniques across all 44 studies.The most frequently used technique for feature selection is correlation-based feature selection (CFS), 19 studies make use of CFS.After CFS, the most used technique is Principal Component Analysis (PCA).It should be noted that while the number of studies that apply multi-co-linearity analysis is only seven (7.14%) the number of studies that apply feature sub selection is 44 (50%).Since DeP studies are heavily dependent on data and their primary aim is to identify metrics that are relevant to DeP, the result of our analysis towards RQ2 is not very encouraging.Every DeP study must ensure that the data used for model building is free from multi-co-linearity and only consists of relevant metrics so that highly accurate estimates of metric relevance and model performance can be made.In this question, we deal with various questions relating to data used in a DeP model.We answer this question in three parts.In the first part, we analyze existing studies to find out what type of data has been used in model training and testing.In the second part, we examine all studies to find out which metrics were found relevant and what percentage of studies is reporting a particular metric as relevant for DeP.And finally, we examine all the studies to find out which metrics were found unsuitable for DeP.

Which Data Sets Are Used for SFP?
In this question, we attempt to summarize the various datasets that are used by DeP studies in the past.A wide variety of public domain datasets like NASA, Eclipse, Apache have been used by multiple researchers to build DeP models; at the same time, there are instances where studies have used data from systems built by students while some have used industrial systems to generate data.Figure 7 shows the distribution of datasets used for building DeP models.The most popular datasets for DeP model building come from NASA MDP program, a total of 42 studies out of 98 use data provided under this program.Eclipse dataset has been used by 10 studies while Apache datasets are used by four studies.The data from PROMISE repository has been used by 17 studies.The Mozilla foundation has also provided data to four studies, three of these studies are addressing security-related defects.The "Others" category includes all industrial software systems, legacy systems and data sources which were not stated in the studies.Since DeP studies are heavily dependent on data and their primary aim is to identify metrics that are relevant to DeP, the result of our analysis towards RQ2 is not very encouraging.Every DeP study must ensure that the data used for model building is free from multi-co-linearity and only consists of relevant metrics so that highly accurate estimates of metric relevance and model performance can be made.

RQ3: What Is the Kind of Data Used in DeP Studies?
In this question, we deal with various questions relating to data used in a DeP model.We answer this question in three parts.In the first part, we analyze existing studies to find out what type of data has been used in model training and testing.In the second part, we examine all studies to find out which metrics were found relevant and what percentage of studies is reporting a particular metric as relevant for DeP.And finally, we examine all the studies to find out which metrics were found unsuitable for DeP.

Which Data Sets Are Used for SFP?
In this question, we attempt to summarize the various datasets that are used by DeP studies in the past.A wide variety of public domain datasets like NASA, Eclipse, Apache have been used by multiple researchers to build DeP models; at the same time, there are instances where studies have used data from systems built by students while some have used industrial systems to generate data.Figure 7 shows the distribution of datasets used for building DeP models.The most popular datasets for DeP model building come from NASA MDP program, a total of 42 studies out of 98 use data provided under this program.Eclipse dataset has been used by 10 studies while Apache datasets are used by four studies.The data from PROMISE repository has been used by 17 studies.The Mozilla foundation has also provided data to four studies, three of these studies are addressing security-related defects.The "Others" category includes all industrial software systems, legacy systems and data sources which were not stated in the studies.In this part, we identify the metrics that are found to be significant predictors of software defects.Figure 8 clearly shows that the most significant predictor of defects is found to be LOC, a total of 26 studies have reported LOC as a significant predictor of defects.Coupling metric (CBO) and RFC have been reported as significant by 25 and 22 studies, respectively.Complexity metric WMC is also reported significant by 15 studies.Only nine studies found DIT to be a significant predictor of software defects and NOC is found to be significant in six studies.

Which Metrics Are Found to Be Significant Predictors for DeP?
In this part, we identify the metrics that are found to be significant predictors of software defects.Figure 8 clearly shows that the most significant predictor of defects is found to be LOC, a total of 26 studies have reported LOC as a significant predictor of defects.Coupling metric (CBO) and RFC have been reported as significant by 25 and 22 studies, respectively.Complexity metric WMC is also reported significant by 15 studies.Only nine studies found DIT to be a significant predictor of software defects and NOC is found to be significant in six studies.In this part, we identify metrics that are not found suitable for software defect prediction.It has been widely recognized that the inheritance metrics are the weakest predictors of defects.In this mapping, we observed that both the Chidamber and Kemerer inheritance metrics, DIT and NOC, have been widely reported insignificant predictors for software defect prediction.DIT has been reported insignificant by 23 studies while NOC is reported weak by 28 studies.Some studies have reported other metrics like LCOM3 and LCOM as weak predictors of defects.LCOM3 and LCOM have been reported insignificant by 7 and 11 studies respectively.

RQ4: What Are the Methods Used for Performance Evaluation in DeP Models?
In this question, we summarize the various performance evaluation methods used in DeP studies.This question is answered in two parts, in this first part we analyze the various performance measures used to capture the performance of a DeP model and in the second part, we summarize the statistical tests used to conduct a comparative analysis of performance measures.

What are Different Performance Measures Used for DeP Studies?
There are a large number of performance measures available for use.Some standard performance measures are Accuracy, Sensitivity, Specificity, and Area under the ROC Curve, Precision and Recall.From Figure 9 it is clear that many of these popular performance measures have been used in DeP studies.The most popular measure is the Recall which is used by 39 out of 98 studies.ROC Area, Accuracy and Precision have been used in 28, 25 and 27 studies, respectively.In this part, we identify metrics that are not found suitable for software defect prediction.It has been widely recognized that the inheritance metrics are the weakest predictors of defects.In this mapping, we observed that both the Chidamber and Kemerer inheritance metrics, DIT and NOC, have been widely reported insignificant predictors for software defect prediction.DIT has been reported insignificant by 23 studies while NOC is reported weak by 28 studies.Some studies have reported other metrics like LCOM3 and LCOM as weak predictors of defects.LCOM3 and LCOM have been reported insignificant by 7 and 11 studies respectively.

RQ4: What Are the Methods Used for Performance Evaluation in DeP Models?
In this question, we summarize the various performance evaluation methods used in DeP studies.This question is answered in two parts, in this first part we analyze the various performance measures used to capture the performance of a DeP model and in the second part, we summarize the statistical tests used to conduct a comparative analysis of performance measures.

What are Different Performance Measures Used for DeP Studies?
There are a large number of performance measures available for use.Some standard performance measures are Accuracy, Sensitivity, Specificity, and Area under the ROC Curve, Precision and Recall.From Figure 9 it is clear that many of these popular performance measures have been used in DeP studies.The most popular measure is the Recall which is used by 39 out of 98 studies.ROC Area, Accuracy and Precision have been used in 28, 25 and 27 studies, respectively.In this part, we identify metrics that are not found suitable for software defect prediction.It has been widely recognized that the inheritance metrics are the weakest predictors of defects.In this mapping, we observed that both the Chidamber and Kemerer inheritance metrics, DIT and NOC, have been widely reported insignificant predictors for software defect prediction.DIT has been reported insignificant by 23 studies while NOC is reported weak by 28 studies.Some studies have reported other metrics like LCOM3 and LCOM as weak predictors of defects.LCOM3 and LCOM have been reported insignificant by 7 and 11 studies respectively.

RQ4: What Are the Methods Used for Performance Evaluation in DeP Models?
In this question, we summarize the various performance evaluation methods used in DeP studies.This question is answered in two parts, in this first part we analyze the various performance measures used to capture the performance of a DeP model and in the second part, we summarize the statistical tests used to conduct a comparative analysis of performance measures.

What are Different Performance Measures Used for DeP Studies?
There are a large number of performance measures available for use.Some standard performance measures are Accuracy, Sensitivity, Specificity, and Area under the ROC Curve, Precision and Recall.From Figure 9 it is clear that many of these popular performance measures have been used in DeP studies.The most popular measure is the Recall which is used by 39 out of 98 studies.ROC Area, Accuracy and Precision have been used in 28, 25 and 27 studies, respectively.Here, we analyze how many studies have made use of statistical tests in their evaluation of the results.By using such methods, the researcher can give a mathematical foundation to the evaluation process.Some advantages of using statistical tests are: 1.
Statistical tests are standardized and allow for suppression of author bias.

2.
Every statistical test follows a mathematical procedure and hence the reader(s) can better understand how the results were processed.The author is not required to illustrate the entire result processing algorithm if it chooses to use a standard statistical test.
Based on the above facts, we strongly suggest that statistical tests should be part of all research studies that involve large datasets.In the 98 primary study s selected for this mapping, 12 statistical tests have been used by 18 studies (Figure 10).The Wilcoxon test has been used by five studies.Wilcoxon test is a two sample non-parametric test.Such tests are used to evaluate the difference between two samples (for defect prediction studies, the sample is the result(s) generated by an algorithm).The drawback of two sample tests is that does not allow comparison of multiple techniques at once.For this reason, we need k sample tests.Examples of k sample tests are Friedman test, t-Test, etc.The analysis of 98 primary study s suggests that k sample tests have been rarely used in literature.The Friedman test has been used in three studies and the t Test has been used in only one study.Two sample tests are better suited for post hoc analysis of results.Here, we analyze how many studies have made use of statistical tests in their evaluation of the results.By using such methods, the researcher can give a mathematical foundation to the evaluation process.Some advantages of using statistical tests are: 1. Statistical tests are standardized and allow for suppression of author bias.2. Every statistical test follows a mathematical procedure and hence the reader(s) can better understand how the results were processed.The author is not required to illustrate the entire result processing algorithm if it chooses to use a standard statistical test.
Based on the above facts, we strongly suggest that statistical tests should be part of all research studies that involve large datasets.In the 98 primary study s selected for this mapping, 12 statistical tests have been used by 18 studies (Figure 10).The Wilcoxon test has been used by five studies.Wilcoxon test is a two sample non-parametric test.Such tests are used to evaluate the difference between two samples (for defect prediction studies, the sample is the result(s) generated by an algorithm).The drawback of two sample tests is that does not allow comparison of multiple techniques at once.For this reason, we need k sample tests.Examples of k sample tests are Friedman test, t-Test, etc.The analysis of 98 primary study s suggests that k sample tests have been rarely used in literature.The Friedman test has been used in three studies and the t Test has been used in only one study.Two sample tests are better suited for post hoc analysis of results.

RQ5: What Is the Performance of Various Learning Techniques across Datasets?
In this question, we attempt to approximate the learning ability of various learning techniques.
To answer this question, we follow a systematic approach consisting of the following steps: a. Identification of performance measure(s) to examine the predictive capability of the proposed models.
b. Identification of datasets to consider for performance validation.c.A collection of the performance related data corresponding to the identified performance measure(s) for the identified datasets.
d.Comparison of performance of each technique pair for all datasets identified.e. Reporting the results.
While a good number of performance measures have been studied in this mapping, we selected the Area under the ROC Curve for performance evaluation, because it is found to suppress the effects of data imbalance on the resultant model.Any dataset that has an unequal distribution of defective and non-defective classes can be considered to be imbalanced [4].Since the data distribution is skewed in imbalanced datasets, learning from these datasets requires a different approach [4].ROC curves [8] aim at maximizing both the sensitivity (True Positive Rate) and 1-Specificity (True Negative Rate) of the model and, therefore, overcome any issue relating to data imbalance.For the final analysis, we imposed the following restrictions:

RQ5: What Is the Performance of Various Learning Techniques across Datasets?
In this question, we attempt to approximate the learning ability of various learning techniques.
To answer this question, we follow a systematic approach consisting of the following steps: a. Identification of performance measure(s) to examine the predictive capability of the proposed models.b.Identification of datasets to consider for performance validation.c.A collection of the performance related data corresponding to the identified performance measure(s) for the identified datasets.d.Comparison of performance of each technique pair for all datasets identified.e. Reporting the results.
While a good number of performance measures have been studied in this mapping, we selected the Area under the ROC Curve for performance evaluation, because it is found to suppress the effects of data imbalance on the resultant model.Any dataset that has an unequal distribution of defective and non-defective classes can be considered to be imbalanced [4].Since the data distribution is skewed in imbalanced datasets, learning from these datasets requires a different approach [4].ROC curves [8] aim at maximizing both the sensitivity (True Positive Rate) and 1-Specificity (True Negative Rate) of the model and, therefore, overcome any issue relating to data imbalance.For the final analysis, we imposed the following restrictions: Only those techniques were selected that were used in five or more studies.This ensures that only those techniques are evaluated which have been widely used in literature.Techniques that are used in fewer studies cannot be compared to studies that have been evaluated in varied experimental setups.Factors such as researcher bias, selection of test parameters, etc., affect the performance of a technique greatly.Summarizing the results of techniques reported in at least five studies will diminish the impact of such factors.
Only those datasets were considered which have been evaluated using ROC analyzes in at least two studies.The objective of this constraint is to eliminate those datasets that have been used only once and could get included in the analysis severely impacting the overall performance of a learning algorithm.
Finally, we assess the performance of 10 learning techniques over 11 datasets (KC1 is used at method level, as well as a class level).Details of the techniques and datasets are given in Table 9.  Summarily, for each pair of algorithms, we check the performance of the AUC measure at the dataset level.Only when the AUC value is available for both techniques (for a given dataset), a comparison is made.Figure 11 provides a detailed description of the performance analysis.
It can be seen that Random Forest is the best performing algorithm.Its performance is inferior to Discriminant Analysis at two datasets and against KStar and CART, it performs better in all cases.The worst learner based on this analysis is the CART algorithm.The CART algorithm is better compared to SVM in three cases while BayesNet, Discriminant Analysis and Naïve Bayes reported better AUC than CART in all comparisons.Not far behind Random Forest is the Discriminant Analysis with better results than CART and KStar in all comparisons, but in comparison to Random Forest, it is lagging against the other techniques.On the downside, KStar did better than CART and the comparison shows that in all comparisons except the ones against CART, KStar performed poorly.Other learning techniques, as can be seen in the graph, had an average performance.The average, maximum and minimum ROC area for the above learning techniques are given in Table 10.

Figure 11. Performance comparison of learning techniques
It can be seen that Random Forest is the best performing algorithm.Its performance is inferior to Discriminant Analysis at two datasets and against KStar and CART, it performs better in all cases.The worst learner based on this analysis is the CART algorithm.The CART algorithm is better compared to SVM in three cases while BayesNet, Discriminant Analysis and Naïve Bayes reported better AUC than CART in all comparisons.Not far behind Random Forest is the Discriminant Analysis with better results than CART and KStar in all comparisons, but in comparison to Random Forest, it is lagging against the other techniques.On the downside, KStar did better than CART and the comparison shows that in all comparisons except the ones against CART, KStar performed poorly.Other learning techniques, as can be seen in the graph, had an average performance.The average, maximum and minimum ROC area for the above learning techniques are given in Table 10.[90,91], etc.In the last five years, considerable work has been done on evaluating the applicability of these techniques to the  Evolutionary computation techniques are learning techniques based on the theory of biological evolution.Techniques of this kind have been successfully used in various software engineering domains like Test Case Prioritization, Automated Test Data Generation [90,91], etc.In the last five years, considerable work has been done on evaluating the applicability of these techniques to the defect prediction problem.The authors in [115] argued that the performance measures of a model can be used as fitness function for search-based techniques and these techniques are better suited for software engineering problems given their superior ability to handle noisy data, better tolerance to missing data and ability to search globally with less probability of returning the local optimal solution.
The most studied techniques on the basis of Table 11 are the Artificial Immune Recognition System (AIRS), Ant Colony Optimization (ACO) and Genetic Programming (GP); all three techniques have been evaluated by two studies.Evolutionary Programming (EP), Evolutionary Subgroup Discovery (ESD), Genetic Algorithm (GA), and Gene Expression Programming (GeP) and Particle Swarm Optimization (PSO) have been studied only once.

Primary Study ACO AIRS EP
To evaluate the performance of a technique, it is important that a good number of independent studies be conducted and the results of all such studies are considered.Out of eight techniques, only three have been used more than once.Hence, for the performance evaluation we consider only these three techniques.

Ant Colony Optimization
In this mapping, two studies have evaluated the performance of ACO.One study uses a derivative of ACO, i.e., AntMiner+ while the other uses the traditional ACO algorithm.Both the studies have reported the prediction accuracy of the resultant models.The study using AntMiner+ has a prediction accuracy of 90.87 while the study using conventional ACO has a prediction accuracy of 64 percent.

Artificial Immune Recognition System
The AIRS technique reports an average recall of 74 percent in SE22 on the KC1 dataset which is very good while SE57 reports a recall of 82 percent on telecommunication software.

Genetic Programming
In SE57, the GP algorithm has reported a recall of 98 percent which is very high.It is evident that search-based techniques have a good predictive capability and this domain should be explored further.Algorithms like Artificial Bee Colony (ABC) have been used widely and in most cases the ABC algorithm has produced better results than other search-based techniques.It is necessary that the performance of the ABC algorithm be evaluated for defect prediction.

RQ7: How Many DeP Studies Attempt to Predict Security Related Defects and Vulnerabilities?
The question related to the identification of security related defects has not been discussed in any systematic mapping to the best of the authors' knowledge.In this study, we estimate the number of studies which take steps in this direction.We try to visit every aspect of security related defects in our response to this question.It was found that only 3% have tried to address security related defects and vulnerabilities.
In Table 12 we present the different datasets that have been used in these three studies.Out of three, two studies use datasets provided by Mozilla Foundation, one study uses data given by Red Hat and one study uses data from an unspecified source.Table 13 gives the details about the learning methods used for the three studies and it can be seen that all three studies have used statistical techniques.There is no use of machine learning or search-based approaches for security related defect prediction.Table 14 shows various statistical tests that are used to evaluate the results of studies working on security related defects and system vulnerabilities.It is encouraging to see that only one out of three studies did used statistical tests.Severity is a very important attribute of a defect.When a defect is detected in a software system, the severity assigned to it helps project managers identify the level of impact it can have on the system.A DeP model that can estimate the severity of a possible defect is very helpful to practitioners, because it makes them aware of the modules which can have defects of high severity, moderate severity and low severity and the order of action ideally should be testing of modules with high severity defects first, then modules with defects having moderate severity level and finally those modules which have defects of low severity.Out of the total 98 studies, only four studies (SE20, SE41, SE50, and SE62) have attempted to build DeP models that identify multiple severity levels.

RQ9: How Many DeP Studies Perform Cross-Project DeP?
Cross-project defect prediction is a very important aspect of DeP in software.There are software organizations that do not have enough historical data that can be used in DeP in current projects.Cross-project DeP uses historical data from different projects of a similar nature and this data can be used by software organizations lacking historical data to build and train DeP models.These trained models can then be used to identify defect prone parts of running projects.Cross-project defect prediction means that the DeP model trained on one set of data and is validated on the data obtained from different software.Essentially, this is what makes a DeP model applicable in a real-life situation.A DeP model is useful only if it can predict defects in future projects, but to apply a DeP model to future projects it must be trained on past project data.To test the effectiveness of such a model and to simulate this real-life application scenario, cross-project defect prediction models are constructed.These models are trained on data from project X and tested on data from project Y.In literature, most of the studies use the same data for training and validation using various types of cross-validation techniques like hold-out, k-cross-validation, etc.Only six studies (SE46, SE54, SE56, SE84, SE85, and SE96) of the total 98 studies have attempted to build cross-project DeP models.
It is important that a good amount of work in future is done on cross-project DeP.The work should cover a diverse range of software projects so that the applicability of a DeP model to projects on which is not trained is thoroughly evaluated.

Conclusions
In this systematic mapping, we address nine research questions corresponding to different stages of development of a DeP model.We explored each aspect of the process ranging from data collection; data pre-processing, and techniques used to build DeP models to measures used to evaluate model performance and statistical evaluation schemes used to mathematically validate the results of a DeP model.We also looked into the existing literature to find out how many studies address security related defects and software vulnerabilities, how many cross-project DeP studies have been conducted and how many studies address defects of varying severity levels.To achieve these goals, we carried out an elaborate search on six digital libraries and collect 156 studies which are then processed using a carefully designed inclusion-exclusion criteria and quality analysis questionnaire.Out of the total 156 studies, we selected 98 studies for addressing nine research questions formed for this systematic mapping.
The research questions in this systematic mapping were constructed by taking into account the following definition of an ideal DeP model: "An ideal DeP model should be able to classify defects on the basis of severity, should detect security-related defects and system vulnerabilities and should have the ability to detect defects on systems on which it was not trained".
Results of this study showed that the existing literature has covered some of the parts of the DeP process fairly well, for example, the sizes of datasets used are large, and nearly all machine learning methods have been examined.But when taking into account the overall approach and effectiveness of DeP studies, there are a lot of shortcomings.Studies have not made much use of multi-co-linearity analysis techniques and only half of the studies selected for mapping have used feature sub-selection techniques.There are only three studies that perform DeP for security related defects and system vulnerabilities, only 6% of the total studies perform cross-project DeP and only four studies have worked on defects of varying severity levels.The number of studies that use statistical tests for model comparison is also limited at 23 studies.Overall, we conclude that although the literature available for software defect prediction has high-quality work, it lacks in the overall methodology that is used to construct defect prediction models.
The following future guidelines are provided on the basis of the results of this study: 1.
Datasets used for DeP studies should undergo thorough pre-processing that includes multi-co-linearity analysis and feature sub selection.

2.
Most of the researches in defect prediction involve data obtained from open source software systems.Few studies use industrial datasets.It is important that industrial data be used for building defect prediction models so that the models can be generalized.

3.
Future studies should make extensive comparisons between search-based techniques, machine learning techniques and statistical techniques.This mapping found out that few search-based techniques like AIRS and GP have good performance in predicting defects, but that number of studies that support this finding is very small.Machine learning and statistical techniques, on the other hand, have been evaluated extensively.To reach conclusions it is important that search-based techniques are also extensively evaluated [173][174][175][176][177][178][179][180][181][182][183][184][185][186][187].4.
Since the amount of data is large, it is necessary that statistical evaluation of the performance of resultant DeP models be performed to reach the conclusion that is backed by mathematical principals. 5.
Security is an important aspect of any software system.It is necessary that researchers explore the applicability of DeP models specifically to security related defects and vulnerabilities.6.
Severity is a basic component of any defect detected in a software system.If the testing resources are limited in number, then it makes sense to test those modules first that have a possibility of carrying high severity defects.It is essential that more studies that address security related defects be conducted.7.
For organizations looking to start new projects, it is important that they can make use of training data from existing DeP models.In reality, a DeP model is useful only if it can apply the rules learned from project A to project B. to explore the possibility of this application, cross-project DeP models should be built, and the focus should be on identification of metrics that provide maximum information about defects.

Figure 3
Figure 3 shows the distribution of DeP studies from the year 1995 to 2018.It shows that the research in DeP picked up in the year 2005 when five studies were conducted.Out of the total 98 studies selected for this mapping, only 12 were conducted till 2004.In 2007, 2008, 2009, 2010, 2011, 2012 and 2013 the number of DeP studies conducted is 11, 15, 10, 15,9, 12, and 5 respectively.These account for 77 out of the selected 98 studies.For 2018, we could find only one quality publication till date, but the data for 2018 should not be considered complete.One of the reasons for the surge in DeP studies after 2005 could be the availability of the public domain data on the PROMISE repository.

Figure 3
Figure 3 shows the distribution of DeP studies from the year 1995 to 2018.It shows that the research in DeP picked up in the year 2005 when five studies were conducted.Out of the total 98 studies selected for this mapping, only 12 were conducted till 2004.In 2007, 2008, 2009, 2010, 2011, 2012 and 2013 the number of DeP studies conducted is 11, 15, 10, 15,9, 12, and 5 respectively.These account for 77 out of the selected 98 studies.For 2018, we could find only one quality publication till date, but the data for 2018 should not be considered complete.One of the reasons for the surge in DeP studies after 2005 could be the availability of the public domain data on the PROMISE repository.

Figure 3 .
Figure 3. Distribution of DeP studies on a year-by-year basis

Figure 3 .
Figure 3. Distribution of DeP studies on a year-by-year basis.

Figure 4 .
Figure 4. Model building techniques used in DeP Studies.

Figure 4 .
Figure 4. Model building techniques used in DeP Studies.

3. 3 .
RQ2: What Are the Different Data Pre-Processing Techniques Used in DeP Models?Research question #2 summarizes all the techniques that deal with data pre-processing.This study answers this question in two parts; the first part summarizes all methods that are used to analyze the degree of multi-co-linearity among features and the second part deals with the techniques used to select relevant features from the dataset.3.3.1.What Are the Different Techniques Used for Multi-Co-Linearity Analysis in DeP Studies?

Figure 5 .
Figure 5. Multi co-linearity analysis procedures applied to DeP studies

3. 3 . 2 .
What are Different Techniques Used for Feature Sub Selection in DeP Studies?In this section, we determine the number of studies that state the use of feature sub-selection techniques and what feature sub-selection techniques are widely used.It is important that we conduct feature sub-selection on the input data before it is supplied to the learning algorithm, because the data could contain redundant and irrelevant features.Out of the 98 studies selected for this systematic mapping, there are 44 studies that use feature sub-selection methods, i.e. exactly 50 percent of the studies use feature sub-selection methods.Figure6shows the distribution of feature sub-selection techniques across all 44 studies.The most frequently used technique for feature selection is correlation-based feature selection (CFS), 19 studies make use of CFS.After CFS, the most used technique is Principal Component Analysis (PCA).It should be noted that while the number of studies that apply multi-co-linearity analysis is only seven (7.14%) the number of studies that apply feature sub selection is 44 (50%).

Figure 6 .
Figure 6.Feature sub-selection procedures applied to DeP studies

Figure 5 .
Figure 5. Multi co-linearity analysis procedures applied to DeP studies.

3. 3 . 2 .
What are Different Techniques Used for Feature Sub Selection in DeP Studies?In this section, we determine the number of studies that state the use of feature sub-selection techniques and what feature sub-selection techniques are widely used.It is important that we conduct feature sub-selection on the input data before it is supplied to the learning algorithm, because the data could contain redundant and irrelevant features.Out of the 98 studies selected for this systematic mapping, there are 44 studies that use feature sub-selection methods, i.e. exactly 50 percent of the studies use feature sub-selection methods.Figure6shows the distribution of feature sub-selection techniques across all 44 studies.The most frequently used technique for feature selection is correlation-based feature selection (CFS), 19 studies make use of CFS.After CFS, the most used technique is Principal Component Analysis (PCA).It should be noted that while the number of studies that apply multi-co-linearity analysis is only seven (7.14%) the number of studies that apply feature sub selection is 44 (50%).

Figure 5 .
Figure 5. Multi co-linearity analysis procedures applied to DeP studies

Figure 6 .
Figure 6.Feature sub-selection procedures applied to DeP studies.

3. 4 .
RQ3: What Is the Kind of Data Used in DeP Studies?

Figure 7 .
Figure 7. Sources of data used across DeP studies

Figure 7 .
Figure 7. Sources of data used across DeP studies.

Figure 8 .
Figure 8. Significant predictors of software defects

Figure 9 .
Figure 9. Performance measures used for DeP studies

Figure 8 .
Figure 8. Significant predictors of software defects.3.4.3.Which Metrics Are Found to Be Insignificant Predictors for DeP?

Figure 9 .
Figure 9. Performance measures used for DeP studies

Figure 9 .
Figure 9. Performance measures used for DeP studies.

2 .
What are Different Statistical Tests of Significance Used in DeP studies?

Figure 10 .
Figure 10.Statistical Tests used by DeP Studies

Figure 10 .
Figure 10.Statistical Tests used by DeP Studies.
The relation given below summarizes the comparison procedure, for each [(a, b): where a and b are learning techniques and a = b] compare AUC(a) and AUC(b) if(AUC(a)>AUC(b)) set 1 else set 0 calculate sum(1) and sum(0) as 'better_than' and 'not_better_than'

Figure 11 .
Figure 11.Performance comparison of learning techniques.

Table 2 .
Quality analysis questions.

Table 3 .
Quality assessment questions.

Table 4 .
Scores assigned to quality questions.

Table 6 .
Quality scores of primary studies obtaining at least 13 points.
3.1.2.Publication SourceAfter quality analysis, 98 studies are left, and 10 studies are discarded.Table7provides the details of the studies published in journals and conferences of high repute.It can be noted that IEEE Transactions on Software Engineering has the highest percentage of selected studies (11.22%) followed by the Journal of Systems and Software (10.20%).

Table 7 .
Summary of top publications.

Table 7 .
Summary of top publications.

Table 8 .
Studies applying multi co linearity procedures.

Table 8 .
Studies applying multi co linearity procedures.
Symmetry 2019, 11, x FOR PEER REVIEW 14 of 28 3.5.2.What are Different Statistical Tests of Significance Used in DeP studies?

Table 9 .
Datasets and learning techniques evaluated in RQ8.

Table 10 .
Search-based techniques used in the literature.Evolutionary computation techniques are learning techniques based on the theory of biological evolution.Techniques of this kind have been successfully used in various software engineering domains like Test Case Prioritization, Automated Test Data Generation

Table 10 .
Search-based techniques used in the literature.

Table 11 .
Search-based techniques used in literature.

Table 12 .
Primary study used to answer research questions.

Table 13 .
Primary studies used to answer research questions.
ML-Machine Learning ST-Statistical Technique| SB-Search-Based.

Table 14 .
Primary studies used to answer research questions.