Empirical Analysis of Rank Aggregation-Based Multi-Filter Feature Selection Methods in Software Defect Prediction

: Selecting the most suitable ﬁlter method that will produce a subset of features with the best performance remains an open problem that is known as ﬁlter rank selection problem. A viable solution to this problem is to independently apply a mixture of ﬁlter methods and evaluate the results. This study proposes novel rank aggregation-based multi-ﬁlter feature selection (FS) methods to address high dimensionality and ﬁlter rank selection problem in software defect prediction (SDP). The proposed methods combine rank lists generated by individual ﬁlter methods using rank aggregation mechanisms into a single aggregated rank list. The proposed methods aim to resolve the ﬁlter selection problem by using multiple ﬁlter methods of diverse computational characteristics to produce a dis-joint and complete feature rank list superior to individual ﬁlter rank methods. The effectiveness of the proposed method was evaluated with Decision Tree (DT) and Naïve Bayes (NB) models on defect datasets from NASA repository. From the experimental results, the proposed methods had a superior impact (positive) on prediction performances of NB and DT models than other experimented FS methods. This makes the combination of ﬁlter rank methods a viable solution to ﬁlter rank selection problem and enhancement of prediction models in SDP.


Introduction
Software development lifecycle (SDLC) is a defined process specifically created for the development of reliable and high-quality software systems. The stages embedded in SDLC such as requirement gathering, requirement analysis, system design, system development and maintenance are stepwise and must be strictly observed to ensure a timely and efficient software system [1][2][3]. Nonetheless, human errors or mistakes are inevitable even though most of the stages in SDLC are conducted by professionals. In recent times, these errors tend to be more profound as modern software systems are intrinsically large, with co-dependent components and modules. Consequently, these errors if not fixed immediately will bring about defective software systems and ultimately software failure. In other words, having defects in software systems will lead to degraded and unreliable software systems. In addition, software failures can generate dissatisfaction from end-users and stakeholders alike as failed software does not meet user requirement(s) after resources (time and effort) have been expiated [4,5]. Hence, it is imperative to consider 1.
Development of novel rank-aggregation based multi-filter feature selection methods.

2.
Empirical evaluation and analysis of the performance of rank-aggregation based multi-filter feature selection methods in SDP.

Related Works
High dimensionality has been regarded as a data quality problem that dampens the prediction efficacies of models in SDP. That is, the presence of irrelevant and redundant software features as a result of the proliferation of software features (metrics) used to characterize the reliability and quality of a software harms the effectiveness of SDP models. From existing studies, FS methods are used to tackle high dimensionality problem by culling only important features. Hence, many studies have proposed and developed diverse FS methods for SDP.
Cynthia et al. [32] experimented on the effect of FS methods on SDP models using various evaluation metrics. They concluded that the selection of important features from a dataset can positively improve the prediction performance of models while substantially reducing the training time and FFS had the best impact in their study. Nonetheless, their study had a limitation in the type of filter-based feature selection considered.
Balogun et al. [22] in their study, investigated the impact of FS methods on models in SDP based on applied search methods. The performances of eighteen FS methods using four classifiers were analyzed. Their findings also support the notion of using FS methods in SDP; however, the respective effect of FS methods on SDP varies across datasets and applied classifiers. Additionally, they posited that filter-based feature selection methods had stable accuracy values than other studied FS methods. Nonetheless, the filter methods selection problem still lingers as the performance of the filter-based FS methods depends on the dataset and classifier used for the SDP process.
In another study, Balogun et al. [23] conducted an extensive empirical study on the impact of FS methods on SDP models based on some contradictions and inconsistencies in existing studies as highlighted by Ghotra et al. [20] and Xu, et al. [24]. From their experimental results, they further established that the efficacy of FS methods depends on dataset and classifier deployed. Hence, there are no best FS methods. This further supports the filter selection problem methods as each filter-based FS methods works differently.
Jia [33] proposed a hybrid FS method based on combining the strength of 3 filter methods (chi-squared, information gain and correlation filter) for SDP. The average ranking of each feature was deduced from the respective rank list and the TopK features were selected. Their experimental results showed that models based on hybrid FS method were better than models from the individual filter methods. Nonetheless, the effectiveness of averaging rank lists of features can be affected by the skewed ranks of each feature [34]. Besides, selecting arbitrary TopK features may not be the best method as useful features may be omitted during the selection process [32].
Wang et al. [35] investigated the ensemble of FS methods in SDP to solve the filter selection problem. 17 ensemble methods were implemented using 18 different filter FS methods. The ensemble methods were based on averaging the ranks of features from individual rank lists. From their experimental results, they reported the superiority of the ensemble approaches. However, similar to Jia [33], averaging rank lists of features can be affected by the skewed ranks of each feature.
Xia et al. [36] hybridized ReliefF and correlation analysis for selection of features in metric-based SDP. the proposed method (ReliefF-Lc) checks correlation and redundancy between modules concurrently. From their experimental results, ReliefF-Lc outperforms other experimented methods (IG and REF). Additionally, Malik et al. [37] conducted an empirical comparative study on the use of an attribute rank method. In particular, the applicability of principal component analysis (PCA) with the ranker search method as a filter FS method was investigated. They concluded that applying PCA with ranker search method in the SDP process can improve the effectiveness of classifiers in SDP. Although their findings cannot be generalized due to the limited scope of their study, however, they coincide with existing SDP studies on the application of FS methods in SDP.
Iqbal and Aftab [30] developed an SDP framework using multi-filter FS and multilayer perceptron (MLP). Besides, a random over-sampling (ROS) technique was integrated to address the inherent class imbalance problem. The proposed multi-filter was developed using correlation feature selection (CFS) with 4 different search methods. From their experimental results, they concluded that the multi-filter method with ROS outperforms other experimented methods.
Consequently, FS methods are efficient in minimizing or reducing features of a dataset and amplifying the efficiency of models in SDP. Notwithstanding, selecting an appropriate filter-based FS method is an open problem. Hence, this study presents an empirical analysis of the impact of rank aggregation-based multi-filter FS method on prediction performances of SDP models.

Methodology
In this section, the classification algorithms, filter-based FS methods, proposed rank aggregation-based multi-filter, experimental framework, software defect datasets and performance evaluation metrics are presented and discussed.

Classification Algorithms
Decision Tree (DT) and Naïve Bayes (NB) algorithms are used to fit base-line prediction models in this study. DT and NB algorithms have been widely implemented in numerous existing studies with satisfactory prediction capabilities. Besides, findings have shown that DT and NB work well with class imbalance [22,38]. Table 1 presents parameter settings of DT and NB algorithms as used in this study. Chi-square (CS) filter method is a statistics-based FS method that tests the independence of a feature to the class label by generating a score to determine the level of independence. The higher the generated score, the higher the dependent relationship between a feature and the class label. CS can be mathematically represented as:

ReliefF (REF)
ReliefF (REF) filter method deploys sampling method on a given dataset and then locates the nearest neighbors from the same and alternate classes. The features of the sampled instances are compared with those of its neighborhood and then subsequently assign a relevant score of each feature. REF is an instance-based FS method that can be applied on noisy and incomplete datasets. It can ascertain dependencies amongst features with low bias.

Information Gain (IG)
Information Gain (IG) filter method selects relevant features by reducing the uncertainties attributed with identifying the class label based on the information theory mechanism when the value of the feature is unknown. The information theory assesses and culls top features before commencing the training process. The entropy of an instance (say X) can be defined as thus: where P x i represents the prior probabilities of X.
The entropy of X given another instance Y is represents as: Hence, the entropy is given as the level by which the entropy of X reduces to show additional information concerning X as given by Y, and is defined thus:

Rank Aggregation-Based Multi-Filter Feature Selection (RMFFS) Method
The proposed RMFFS is based on taking into consideration and combining the strengths of individual filter ranks methods. The essence of this is to resolve the filter method selection problem by considering multiple rank lists in the generation and subsequent selection of top-ranked features to be used in the prediction process. As depicted in Algorithm 1, the individual rank list from CS, REF, and IG filter methods are generated from the given dataset. These individual rank lists are mutually exclusive as each filter methods considered are based on different computational characteristics. This is to ensure diverse representations of features to be selected for the prediction process. Thereafter, the generated rank lists are aggregated together using rank aggregation functions as presented in Table 2. The respective rank aggregation function combines the individual rank lists into a single aggregated list by leveraging on the relevance score attributed to each feature on the individual rank lists. Minimum and maximum rank functions select the minimum and maximum relevance score, respectively, produced by the aggregated rank list. The range rank function selects features from the aggregated list based on the range value computed from the relevance scores. The arithmetic mean, geometric mean and harmonic mean rank functions combine the individual rank lists into a single aggregated list by computing the arithmetic mean, geometric mean and harmonic mean, respectively, on the relevance score attributed to each feature on the individual rank lists. This is to give equal representation and consideration to each feature from each rank list. Features on the aggregated list with high relevance scores indicates that such features are ranked low in the individual rank list and as such can be dropped. A novel dynamic and automatic threshold value based on geometric mean function are applied to the aggregated list to select relevant features. The geometric mean of the aggregated relevance score is computed and features with aggregated relevance score less than or equal to the computed threshold values are selected. Geometric mean functions consider the dependency amongst the features and the compounding effect in its computation. Finally, optimal features are selected as the resulting features of the RMFFS method.
Generate Rank list R n for each filter rank method i 3. } 4. Generate Aggregated Rank list using Aggregator functions: return P t 12. } Table 2. Rank Aggregation Methods.

Aggregators Formula Description
Min () Selects the minimum of the relevance scores produced by the aggregated rank list Selects the maximum of the relevance scores produced by the aggregated rank list Selects the range of the relevance scores produced by the aggregated rank list Selects the mean of the relevance scores produced by the aggregated rank list Selects the geometric mean of the relevance scores produced by the aggregated rank list Selects the harmonic mean of the relevance scores produced by the aggregated rank list Table 3 presents the software defect datasets used for training and testing SDP models in this study. These datasets are culled from NASA repository and have been widely used in SDP. Specifically, the cleaned version of NASA datasets was used in the experimentation [39,40]. Table 2 shows a description of the selected datasets with their respective number of features and number of instances.

Performance Evaluation Metrics
For evaluation of ensuing SDP models, Accuracy, F-Measure, and Area under Curve (AUC) were selected. These evaluation metrics have been widely used and proven to be reliable in SDP studies [7,41,42]. i.
Accuracy is the number or percentage of correctly predicted data out of all total amount of data.
ii. F-Measure is defined as the weighted harmonic mean of the test's precision and recall iii

Experimental Framework
The experimental framework of this study as depicted in Figure 1 is presented and discussed in this section.
To assess effects of proposed RMFFS methods on prediction performances of SDP models, software defect datasets (See Table 3) were used to build SDP models based on NB and DT classifiers (See Table 1). Different scenarios are experimented to have unbiased and standard performance comparison of the ensuing SDP models.
• Scenario 1 considered the application of the baseline classification algorithm (NB and DT) on the original defect datasets. In this case, NB and DT will be trained and tested with the original defect datasets. This is to determine the prediction performances of the baseline classifiers on the defect datasets. Experimental results and findings based on the aforementioned scenarios will be used to answer the following research questions.
• RQ1. How effective are the proposed RMFFS methods compared to individual filter FS methods? • RQ2. Which of RMFFS methods had the highest positive impact on the prediction performance of SDP models?
Ensuing SDP models from each scenario will be developed and evaluated based on 10-fold cross-validation (CV) technique. The application of 10-fold CV technique is to avoid data variability problems and to produce SDP models with low bias and variance [43][44][45][46]. Besides, CV techniques have been widely used in many existing studies with SDP being no exception. The prediction performances of ensuing models from each scenario will be measured using selected performance metrics (See Section 3.5) and their predictive performance will be analyzed and compared. All experiments are carried out using the WEKA machine learning tool [47].

Results and Discussion
In this section, experimental results based on the experimental framework as illustrated in Figure 1 is presented and discussed.  Experimental results and findings based on the aforementioned scenarios will be used to answer the following research questions.

•
RQ1. How effective are the proposed RMFFS methods compared to individual filter FS methods? • RQ2. Which of RMFFS methods had the highest positive impact on the prediction performance of SDP models?
Ensuing SDP models from each scenario will be developed and evaluated based on 10-fold cross-validation (CV) technique. The application of 10-fold CV technique is to avoid data variability problems and to produce SDP models with low bias and variance [43][44][45][46]. Besides, CV techniques have been widely used in many existing studies with SDP being no exception. The prediction performances of ensuing models from each scenario will be measured using selected performance metrics (See Section 3.5) and their predictive performance will be analyzed and compared. All experiments are carried out using the WEKA machine learning tool [47].

Results and Discussion
In this section, experimental results based on the experimental framework as illustrated in Figure 1 is presented and discussed.     Concerning prediction performance based on AUC values, Figure 3 presents the boxplot representations of NB and DT AUC values. Improved AUC values were observed on     Concerning prediction performance based on AUC values, Figure 3 presents the boxplot representations of NB and DT AUC values. Improved AUC values were observed on     Concerning prediction performance based on AUC values, Figure 3 presents the boxplot representations of NB and DT AUC values. Improved AUC values were observed on Concerning prediction performance based on AUC values, Figure 3 [20,[22][23][24]. However, it can be deduced that the efficacy of FFS methods (IG, REF, and CS) varies across datasets and depends on the choice of prediction models. Based on no free lunch theorem, since there is no overall best FFS method, selecting an appropriate FFS method for SDP becomes crucial and hence, the filter selection problem. Consequently, this observation further supports the aim of the study on the development of the multi-filter FS method for SDP.
As presented in Figures 2-4, the proposed RMFFS methods (Mean, Min, Max, Range, GMean, HMean) not only had a superior positive impact on NB and DT models but also had a better positive impact than the individual CS, IG and REF FS methods. Particularly, Tables 4 and 5 presents the prediction performances (average accuracy, average AUC, and average f-measure) of NB and DT models with proposed RMFFS methods and individual FFS methods, respectively. Additionally, Table 5  Furthermore, Scott-KnottESD statistical rank tests, a mean comparison approach that uses hierarchical clustering to separate mean values into statistically distinct clusters with non-negligible mean differences was conducted [48,49] to show significant statistical differences in the mean values of experimented methods and results. Models that have the same color means they are in the same category and there are no statistically significant differences amongst them. Likewise, models with different color indications signify that models in that region are statistically significant to other methods. Figures 5-7 show the Scott-KnottESD Rank test of experimented FS methods on NB and DT models based on average accuracy, average AUC and average f-measure values, respectively. Table 6 summarized the statistical rank test of FS methods on NB and DT models.

Models
Accuracy Additionally, Table 5  Furthermore, Scott-KnottESD statistical rank tests, a mean comparison approach that uses hierarchical clustering to separate mean values into statistically distinct clusters with non-negligible mean differences was conducted [48,49] to show significant statistical differences in the mean values of experimented methods and results. Models that have the same color means they are in the same category and there are no statistically significant differences amongst them. Likewise, models with different color indications signify that models in that region are statistically significant to other methods. Figures 5-7 show the Scott-KnottESD Rank test of experimented FS methods on NB and DT models based on average accuracy, average AUC and average f-measure values, respectively. Table 6 summarized the statistical rank test of FS methods on NB and DT models.      From Figure 5A, concerning average accuracy values, NB models with HMean, Min, GMean, Max, Mean, Range-based RMFFS methods and NB + REF fall into the same category. These set of models are superior in performance and there are statistically significant differences in their means to other NB models. NB + CS and NB + IG models rank second while NB model with NO FS method ranks third. Additionally, from Figure 5B, DT models with GMean, Mean, HMean, Max, Min, DT + CS, Range, DT + IG are statistically superior and ranks higher than DT + REF and DT models. Additionally, the ordering of models from the statistical rank test is important and models which appear first (from left to right)   DT models with GMean, Mean, HMean, Max, Min, DT + CS, Range, DT + IG are statistically superior and ranks higher than DT + REF and DT models. Additionally, the ordering of models from the statistical rank test is important and models which appear first (from left to right) are superior to the other models regardless of their groupings. Similarly, based on average AUC values as presented in Figure 6, NB models with Min, HMean, GMean, Max, Mean-based RMFFS ranks first, while NB + REF, NB + IG, NB + CS and Range-based RMFFS ranks second and NB with no FS method ranks last. DT models GMean, Min, Max, HMean, Mean-based RMFFS ranks first, Range-based RMFFS and DT + CS ranks second, DT + IG ranks third, DT with no FS methods ranks fourth and DT + REF ranks fifth.
Lastly, Figure 7 presents the Scott-Knott rant test results using average f-measure values. NB models with HMean, GMean, Min, NB + CS, Max, Mean-based RMFFS are superior and ranks first while NB models based on Range RMFFS, NB + IG and NB + REF ranks second and NB with NO FS method came last. DT models with GMean, Mean, HMean, Min and Max-based RMFFS ranks first, DT with NO FS method, DT + CS, Rangebased RMFFS and DT + IG ranks second while DT + REF ranks third. Table 6 summarizes the analysis of the Scott-Knott Rank Test of experimented FS methods on NB and DT models.
Summarily, from the experimental results and statistical test, the proposed RMFFS methods recorded superior positive impact on the prediction performances of SDP models (NB and DT) than individual FSS (IG, REF and CS) methods on the studied defect datasets. Whereas, the GMean-based RMFFS method outperforms all experimented FS methods. These findings, therefore, answers RQ1 and RQ2 (see Section 3.6) as presented in Table 7. Additionally, the effectiveness of RMFFS addresses filter selection problem by combining the strength of individual filter FS methods in SDP. Hence, it is recommended as a viable option to combine filter (multi-filter) methods to harness the strength of respective FFS and capabilities of filter-filter relationships in selecting germane features for during FS methods as conducted in this study.

Research Questions Answers
RQ1. How effective are the proposed RMFFS methods compared to individual filter FS methods?
The proposed RMFFS outperforms individual FFS methods with significant differences.
RQ2. Which of RMFFS methods had the highest positive impact on the prediction performance of SDP models?
GMean-based RMFFS method was superior to other RMFFS methods.

Conclusions
This study addresses high dimensionality and filter selection problems in software defect prediction by proposing novel rank aggregation-based Multi-Filter Feature Selection Methods (RMFFS). The selection of an appropriate filter rank method is often a hard choice as the performance of filter methods depends on datasets and classifier used. Consequently, RMFFS combines individual rank lists generated by independent FFS methods from the software defect dataset into a single rank list based on rank aggregation methods. Additionally, a geometric mean function was used to automatically select top-ranked features from the aggregated list. For assessment, features generated by RMFFS and other experimented filter methods (IG, CS, REF and No FS method) were applied with NB and DT classifiers on software defect datasets from NASA repository. Analysis from the experimental results showed the effectiveness and superiority of RMFFS methods as they had a superior positive impact on the prediction performances of NB and DT classifiers than other experimented FS methods in most cases. That is, the proposed RMFFS was able to generate a more stable and complete subset of features that best represent studied datasets. Hence, this makes the combining of individual filter rank methods a viable solution to the filter rank selection problem and enhancement of prediction models in SDP. In a broader perspective, findings from this study can be used in experts and researchers in SDP and other applicable research domains that require FS methods as a method of address high dimensionality and filter selection problem.
As a limitation to this study, we intend to explore and extend the scope of this study by investigating other ensemble configurations of FS method with more prediction models as future works. Additionally, the effect of threshold values on the efficacies of FFS is worth investigating as the adequate threshold value to an extent relies on the used dataset.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.