Software Defect Prediction Using Heterogeneous Ensemble Classiﬁcation Based on Segmented Patterns

: Software defect prediction is a promising approach aiming to improve software quality and testing efﬁciency by providing timely identiﬁcation of defect-prone software modules before the actual testing process begins. These prediction results help software developers to effectively allocate their limited resources to the modules that are more prone to defects. In this paper, a hybrid heterogeneous ensemble approach is proposed for the purpose of software defect prediction. Heterogeneous ensembles consist of set of classiﬁers of different learning base methods in which each of them has its own strengths and weaknesses. The main idea of the proposed approach is to develop expert and robust heterogeneous classiﬁcation models. Two versions of the proposed approach are developed and experimented. The ﬁrst is based on simple classiﬁers, and the second is based on ensemble ones. For evaluation, 21 publicly available benchmark datasets are selected to conduct the experiments and benchmark the proposed approach. The evaluation results show the superiority of the ensemble version over other well-regarded basic and ensemble classiﬁers.


Introduction
Individuals and society increasingly rely on advanced software systems. Because software is intertwined with all aspects of our lives, it is essential to produce reliable and trustworthy systems economically and quickly. In order to ensure the desired software quality at a lower cost, much effort has been invested on software reliability and software quality assurance (SQA) [1,2]. With limited resources, however, this is increasingly being challenged by the rapid growth in size and complexity of today's software. Defective software modules increase the development and maintenance costs and cause customer dissatisfaction [3,4].
Software defect prediction is one of the SQA activities that aims to automatically predict fault-prone software modules using historical software information from an earlier deployment or identical objects, for example source code edit logs [5] and bug reports [6], before the actual testing process begins. Effective defect prediction could help test managers locate bugs and facilitate the allocation of limited SQA resources optimally and economically; thus, it has become an extremely important research topic [7][8][9][10][11][12]. Commonly, a prediction model is used to predict the defective software modules in one of the three categories: binary class classification of defects [13][14][15][16], number of defects/defect density prediction [17][18][19][20], and severity of defect prediction [21,22]. Among them, the binary class classification is the most frequently used types of prediction scheme, where software modules having one or more defects are marked as defected and modules having zero defects are marked as non-defected. In this type of defect prediction schema, researchers have explored the use of various classification techniques, including statistical techniques such as Naïve Bayes (NB) [23] and Logistic Regression [24]; supervised techniques such as Decision Tree (DT) [25], Support Vector Machine (SVM) [26], ensemble methods [16,[27][28][29], and Case Based Reasoning [30]; semi supervised techniques such as Expectation Maximization (EM) [31]; and unsupervised techniques such as K-means clustering [32] and Fuzzy clustering [33]. Most of the studies in the literature have used statistical and supervised learning techniques [34].
Although a large number of studies have been conducted to build and evaluate defect prediction models using different classification techniques in the context of binary class classification, still the prediction accuracy of defect prediction techniques is found to be considerably low, with a high misclassification rate [26,[34][35][36]. Looking at these results, one questions the dependability of these techniques for software defect prediction [34,37]. Therefore, it will be important to design more advanced techniques to improve the performance of defect prediction models [34,38].
In this work, a hybrid heterogeneous ensemble approach is proposed for improving the accuracy of software defect prediction. The core argument for this approach is to develop expert and robust classification models of different natures based on groups of similar points. In other words, the classification models are of different machine learning types like lazy classifiers, decision trees, naïve bayes, and ensembles. While on the other hand, similar points refer to a group of points that are as close as possible according to a similarity measure like the euclidean measure. These groups of data are generated using a clustering stage. Unlike most of the previous works that generate general models for all data, this work aims to develop several expert models based on the characteristics of the data. Two versions of the proposed approach are developed and experimented. The first is based on simple classifiers (i.e., k-Nearest Neighbour (k-NN), NB, and DT), and the second is based on ensemble ones (i.e., Bagging, Adaptive Boosting (AdaBoost), Random Forest (RF), and XGBoost (XGB)). Extensive experiments based on 21 well-known benchmark datasets are conducted to evaluate the proposed approach.
The remainder of this article is organized as follows: The next section presents related work on defect prediction. The preliminaries of the algorithms utilized in the proposed approach are given in Section 3. Section 4 presents the proposed hybrid heterogeneous ensemble approach for software defect prediction. Section 5 discusses the model evaluation metrics, and Section 6 presents the benchmark datasets specifications. Section 7 is devoted to the benchmarking experiments and discusses their respective results. Finally, Section 8 draws conclusions and describes promising directions for future work.

Related Work
During the last two decades, software defect prediction problem became a noteworthy research topic, increasingly catching the interest of researchers. A software defect prediction model can be used to classify software modules into defected or non-defected (binary class classification), to predict the number of defects in a software module, or to predict the severity of the defects. In the context of binary class classification, hundreds of different defect prediction models have been published. To build these models, researchers have used various classification techniques to build the defect prediction models such as Logistic Regression [24], NB [23], SVM [26], ANN [39], Genetic Programming [40], Ant Colony Optimization [14], Particle Swarm Optimization [41], RF [42], Case Based Reasoning [30], DT [25], ensemble methods [16,28,29,43,44], EM [31], Fuzzy clustering [33], K-means clustering [32], Association Rule Mining [45], and the Artificial Immune Systems [46,46,47].
In these techniques, researchers have applied several statistical and machine learning techniques to predict fault proneness models and reduce software development and maintenance costs. Among them, the machine learning technique is the most popular [1]. The majority of software defect prediction techniques build models using metrics and faulty data from an earlier deployment or identical objects and then use the models to predict whether the modules presently under development contain defects, which is called a supervised learning approach [7]. Among the supervised learning techniques, ANN is one of the most popular, having received a great deal of attention) [39,48,49]. It should be pointed out that the ANN technique has some drawbacks in application for software defect prediction, the most important being the difficulty in determining the best neural network architecture in each application domain [49]. In contrast, there are other approaches, for example, clustering [33], which do not use previous data; these approaches are called unsupervised learning approaches. It is worth pointing out that some researchers, for example [50], classify software defect prediction techniques into descriptive and predictive techniques.
The usage of machine learning algorithms has increased in the last decade and is still one of the most popular methods for defect prediction [51,52]. Challagulla et al. [53] conducted an empirical assessment to evaluate the performance of various machine learning techniques and statistical models for predicting software quality. The experiments on four different real-time software defect datasets using different predictor models revealed that the 1R rule-based classification learning algorithm and Instance-based learning along with Consistency-based subset evaluation technique is more consistent in achieving accurate predictions as compared with other models. Based on their results, the authors presented a high-level design of an intelligent software defect analysis tool for defect assessment and dynamic monitoring of software modules. Catal and Diri [54] investigated the effects of data size, metrics, and feature selection techniques on software defect prediction. Nine classifiers were examined to explore which classifier performs best before and after applying feature reduction. They showed that NB is the best prediction algorithm for small datasets while Random Forests gives the best prediction performance for large datasets. Kaur and Pallavi [55] discussed the utilization of numerous machine learning approaches-for example, association mining, classification, and clustering in software defect prediction-but did not provide a comparative performance analysis of the techniques. Kumar and Gopal [56] proposed a binary classifier referred as LSTSVM which is the Least Square variant of Twin Support Vector Machine. The experiments showed that LSTSVM has comparable classification accuracy to Twin Support Vector Machine (TSVM) but with considerably lesser computational time. Agrawal and Tumar [57] proposed a feature selection based on the LSTSVM model for software defect prediction. A comparative analysis of various classification approaches against four PROMISE datasets showed the superiority of the proposed predictive model over other models, i.e., SVM and DT, in three datasets. Again, Tumar and Agrawal [58] developed a software defect prediction system using a weighted LSTSVM to consider misclassification cost of defective software modules. A comparison has been performed between the proposed approach and nine of the existing approaches using different performance measures. The results on eight datasets demonstrated the effectiveness of the proposed approach. Shukla and Verma [59] reviewed and analysed various literature studies on defect prediction area, investigated recent advancement in this area, and drew various conclusions. Dwivedi and Singh [60] analysed and compared various data mining classification and prediction techniques such as NN, NB, and k-NN for the software defect prediction models. The results showed that NN can outperform other two classifiers with the average accuracy of 91.54%. Wang et al. [8] proposed to leverage the directly learned semantic features to build machine learning models for predicting defects. The results on ten open source projects showed that the automatically learned semantic features using Deep Belief Network (DBN) improved within-project defect prediction on average by 14.7% in precision, 11.5% in recall, and 14.2% in F1. To reduce the complexity of metric selection and defect prediction, Huda et al. [61] proposed a framework for finding significant metrics to build and evaluate an automated software defect prediction model, using a hybrid combination of wrapper and filter techniques. Experimental results with eight NASA software datasets showed that the proposed hybrid approaches can select the most significant metrics with high prediction accuracy compared with conventional wrapper or filter approaches in some of the datasets. The highest accuracy achieved by the hybrid approach was almost 91% at different subset of metrics. Recently, Bowes et al. [62] performed a sensitivity analysis for the prediction uncertainty produced by four different classifiers. Their results showed that classifier ensembles with decision-making strategies that are not based on majority voting are likely to perform best. Zhou et al. [38] proposed a new deep forest model to build the defect prediction model (DPDF). Their results on 25 open source projects from four public datasets showed that the DPDF increased AUC value by 5% compared best traditional machine learning algorithms.

Preliminaries
In this section, we briefly describe each of the algorithms utilized in the proposed approach.

NB
NB is a statistical probability-based classifier based on the Bayes theorem. NB is a family of algorithms based on a common principle, which assumes that all of the predictors are equally important and independent of each other [63]. In other words, when the class variable is given, it assumes the presence or absence of a particular feature is not related to the presence or absence of any other feature [64]. Instead of simple classification, NB reports the probability of an instance belonging to each individual class. In our case, the class with the highest posterior probability is the outcome of prediction that predicts whether a software module is defective or non-defective.

k-NN
k-NN is an instance-based learning method that classifies instances within a dataset by assigning the label of the closest neighbour to each new pattern during the testing phases. If the instances are tagged with a classification label, then the majority class of the closest k neighbours is assigned to the unclassified instance. Although the power of k-NN has been proven in a number of real domains, they have large storage requirements and their performance is sensitive to the choice of the k.

DT
DT is a logic-based learning method that classifies instances by sorting them based on feature values. The main idea underlying DT for classification tasks is the recursive partition of the data space; thus, a DT can be equivalently expressed as a set of rules. DT utilizes a tree-like data structure where each node in the tree represents a feature in an instance to be classified, whereas each branch represents a value that the node can assume [65]. The classification of instances starts at the root node, and instances are sorted based on their feature values. The most well-known algorithm in the literature for building tree is the C4.5, which is an extension of the ID3 algorithm. Although DT can effectively deal with nonlinear relationships, it is sensitive to noisy data and also may lead to overfitting.

Adaboost
AdaBoost is a widely used boosting algorithm that constructs an ensemble by performing multiple iterations each time with different instance weights and adjusts adaptively to the errors returned by classifiers from previous iterations [66,67]. Changing the weights of training instances in each iteration forces the learning algorithms to put more emphasis on instances that were incorrectly classified previously and less emphasis on instances that were correctly classified previously. In other words, weights of misclassified instances are increased, whereas weights of correctly classified instances are decreased. This will ensure misclassification errors for these misclassified instances count more heavily in the next iterations. AdaBoost uses the predictions of multiple weak classifiers and gives a final prediction through combined voting on techniques. Weak classifiers as originally defined by Freund and Schapire are classifiers that perform a little better than random guessing [68].

Bagging
Bagging is an ensemble technique that is used to improve the stability and accuracy of machine learning algorithms by combining the prediction of multiple weak classifiers [69]. Bagging works better for unstable learning algorithms where a little change in the training set results in large changes in predictions (i.e., ANN, DT). Bagging predicts an outcome several times from different training sets that are combined either by voting or with uniform averaging [70]. To describe the bagging algorithm, consider a dataset with N instances and a binary class label. The following steps summarize the Bagging algorithm: 1.
Generate a random training set of size N with replacement from the data.

2.
Train the random training set using any classification technique.

3.
Assign a class to each node. 4.
Repeat steps 1 to 3 many times.

5.
Use voting to predict the class label.

RF
The RF classifier is a special case of Bagging consisting of a collection of tree-structured classifiers. RF selects random features in order to create bootstrap models using decision trees [71]. To do so, it creates a random forest of multiple decision trees by selecting data and variables randomly. A subset of instances is chosen randomly from the selected attributes and assigned to the learning algorithm. The forest selects the classification that has the most votes over all the trees in the forest. RF relies on aggregating the output from many "shallow" trees (called stumps), which are tuned and pruned without much analysis, so that the errors from many stumps will disappear when aggregated and lead to a more accurate prediction. Randomization in RF appears in two places:

1.
Each tree is trained using a random sample with replacement from a training set.

2.
When training individual trees, a random subset of features is used for searching for splits. The randomization reduces the correlations among trees, which improves the predictive performance.

XGB
XGB is a decision-tree-based supervised learning algorithm that implements a process called Gradient Boosting to construct an ensemble learner [72]. XGB optimises a collection of weak decision tree learning models to build an accurate and reliable predictor, decision tree ensemble, which uses the output of the weak learners in the final prediction. XGB improves upon the base Gradient Boosting Machines (GBMs) framework through algorithmic enhancements (i.e., Regularization, Sparsity Awareness, Weighted Quantile Sketch) and software and hardware optimization techniques (i.e., Parallelization, Tree Pruning). These improvements yield superior results using less computing resources in the shortest amount of time.

K-Means Clustering
K-means clustering is one of the most popular unsupervised learning methods. The main goal of K-means is to group similar data instances together and find patterns in the given datasets. To achieve this goal, K-means defines number of clusters (K) and then groups the similar elements into these clusters. It starts by selecting the centriods, which are the starting points of the clusters. In the next step it assigns the instances to the closest centroid and then updates the positions of the centroids iteratively until the centroids are stabilized or the predefined maximum number of iterations is reached.
Given a dataset of n instances S = {x 1 , . . . , x n } ∈ R d , and an integer number K, K-means algorithm aims to find C = {c 1 , . . . , c K }, the set of centroids with respect to the following error function: As mentioned before, K-means assigns instances to one of the specified clusters according to the similarity between them. To measure the similarity, it usually uses the Euclidean distance between the instance and the centroids.

Proposed Approach
In this paper, we propose an approach for software defect classification where models are developed based on clustered patterns. The approach is composed of three main phases: In the first phase, a clustering process is applied on the training data to segment it into a set of groups of similar instances. In the second phase, different classifiers are trained based on the generated groups from the first phase. The third phase evaluates the developed models and uses them for predicting unrepresented instances. These three phases are illustrated in Figure 1 and described in details in the following three subsections.

Clustering Phase
The clustering phase is the first phase of the hybrid algorithm. The idea is to start a preprocessing step to prepare the data for developing the classification algorithms. The data is split into two parts: the training part, which is the only part that is used in this stage for clustering, and a testing part, which is used to evaluate the performance of the trained models. During this preprocessing step, we start by clustering training data into a set of predefined number of clusters. We can use any clustering technique, but in our work one of the most popular clustering techniques, the k-means algorithm, will be used.

Models Development Phase
After segmenting the training data into a set of clusters, the next step is to develop a classification model for each cluster. To do that, several classification algorithms are trained and evaluated on each cluster using the cross-validation methodology. The goal is to find the most suitable and expert model for each cluster. For example, suppose we have three classification algorithms called X, Y, and Z. All algorithms will be trained and evaluated based on each cluster. For example, if algorithm Y produced the highest average accuracy over the cross-validation process based on a given cluster, then Y will be assigned to this cluster for future predictions because it showed higher prediction power than algorithms X and Z. It is important to note that when there is a cluster of only one class the classifier works as one class classification algorithm. So, it trains based on one class in the training phase and detects the other class in the testing phase as outlier. After finishing this phase, each cluster will have its own expert model. Note that the best classifier can be different from one cluster to another. Figure 2 gives an example of this phase with three different classifiers trained on three clusters, and it shows how the classifiers in the final model are selected.
In this work, two types of classifiers are implemented to produce two versions of the proposed approach. In the first version, basic classifiers are used. This version will be referred to as K-Means/Basic classifiers (KMB). In the second version, ensemble classifiers will be utilized. The latter version will be referred to as K-Means/Ensemble classifiers (KME).

Testing Phase
In the testing phase, we are concerned with the testing data generated in the first phase. For each instance in the testing data, we must specify to which cluster it belongs by calculating the distance between the instance and each centroid of the clusters. As a result, the instance will belong to the closest cluster (most similar), and it will be given to the model that has been assigned to the cluster in the training phase for final prediction. To determine the similarity, we use the Euclidean distance between the testing instance I and the centroid C, which can be defined as follows: where d is the number of input features in the dataset. After classifying all instances in the testing data, we can use the predictions against the actual values of classes to evaluate the performance of the given hybrid algorithm. The procedure of the algorithm is explained in Algorithm 1. Calculate the distance between I and each C j Find the closest C j to I and its corresponding f i Prediction [I] = f i (I)

Model Evaluation Metrics
To evaluate the proposed software defect prediction model, we refer to the confusion matrix shown in Table 1, which is the primary source for accuracy estimation in classification problems. Based on this confusion matrix, the following criteria are used for evaluation:

1.
Recall: is the fraction of relevant instances that have been retrieved over the total amount of relevant instances (i.e., coverage rate). It can be expressed by the following equation:

2.
Precision: is the ratio of relevant instances among the retrieved instances. It can be given by the following equation: 3. G-mean: is the geometric mean of the recalls of each class and it can be measured by the following equation:

Datasets Description
To facilitate the replication and verification of our experiments, the proposed approach is applied to a series of 21 well-studied public and available online software defect benchmark datasets with various attributes and instances. Eleven of the studied datasets are obtained from the NASA corpus while ten from the PROMISE software engineering corpus [73]. However, the NASA corpus is a known-to-be noisy corpus [74,75]. To avoid the effect of such noisy data on the results of our experiments, we use the cleaned version of the NASA corpus as provided by [74], which is available online (https://figshare.com/articles/MDP_data_sets_D_and_D_-_zipped_up/6071675). The NASA datasets were collected from real software projects from different domains by NASA and have various software modules developed in several different programming languages including, C, C++, and Java, various scales of lines of code, and various types of software metrics. For instance, in the cleaned version of the NASA corpus, the JM1 dataset consists of 7782 instances (1672 defective/6110 defect-free) where each instance includes a total of 22 attributes, of which five are different lines of code measures, three are McCabe metrics, four are base Halstead measures, eight are derived Halstead measures, one is a branch-count, and one is a decision attribute that indicates whether a particular instance is defective or non-defective. The PROMISE datasets were collected from open source software projects developed in a variety of settings (e.g., Apache, GNU) which provides different metrics than the NASA corpus does. Table 2 shows information and some general statistics of each dataset.

Experiments and Results
The experiments was executed 30 independent time then the average of the results was calculated. The experiments will be conducted in three steps: • The best number of clusters is experimented for each dataset, and the best model for each cluster is found.

•
The proposed approach is experimented based on utilizing simple and common classifiers (NB, k-NN, and DT).

•
The proposed approach is experimented based on utilizing powerful ensemble classifiers (Bagging, AdaBoost, RF, and XGB).
For k-NN, the number of neighbours is set to 3 as this value showed best performance based on the training data compared to other values (i.e., 1, 5, 7, and 9). For the ensemble classifiers, the selected base classifier is decision tree, and the ensemble size is set to 100. The latter settings yield the best performance based on the training data with the least computation effort.

Finding the Best Number of Clusters and Their Corresponding Models
The number of clusters in our final model will be determined based on the G-mean results of the training phase. Cross-validation approaches can help avoid overfitting in model selection [76]. Therefore, the training process is conducted based on two-folds cross-validation to avoid overfitting. Four settings are experimented to determine the required number of clusters: 3, 5, 7 or 9. The number of clusters that yield the highest G-mean value will be selected to be applied for the final model in the testing phase. Tables 3 and 4 show the G-mean results of the cross-validation training phase of our approach based on the basic classifiers and ensemble ones, respectively. As it can be seen in the tables, the best number of clusters varies from one dataset to another. This confirms that the software defect benchmark datasets are varied in their nature, where different groups can be identified, and these groups have a number of similar patterns. Tables 3 and 4 also show the distribution of classes in each cluster. It is important to note here that sometimes the clustering process produce clusters that contain only class. For such cases, any new instance that is closer to the center of a one-class cluster will be simply given the same class of the cluster.
For each cluster that results from the previous step, the best performing classifier is assigned. Demonstrating the best performing models for KMB and KME, Figures 3 and 4 show the frequency of the best models over all datasets. In the case of KMB, we can see that the DT is the most frequent classifier in most of cases, followed by NB and k-NN. In the case of KME, Bagging is the most frequent model, followed by XGB, RF and AdaBoost, respectively. This supports the idea that there is no dominating classifier for all the data patterns, and each group of similar patterns needs a model that fits its particular characteristics.

KMB vs. Basic Classifiers
In this part of the experiments we verify the performance of the KMB version of our proposed approach by experimenting it based on the 21 benchmark datasets and comparing it with the basic classifiers NB, k-NN, and DT.
Because all of the datasets are highly imbalanced, considering the accuracy ratio for evaluation is misleading. Therefore, other metrics (i.e., recall, precision, and G-mean) should be examined.
The results of precision, recall and G-mean values are shown in Table 5. According to the results, we can see that KMB hits the best or very competitive precision and recall values for most of the datasets. The results of the G-mean evaluation measure reveal that KMB has better performance in 21 datasets, followed by NB and k-NN respectively, where NB achieved best results only in two datasets, and k-NN in one dataset.
For better visualization of the results, radar Figures 5-7 are plotted for KMB and the basic classifiers. Table 5. Precision, Recall, and G-mean for the basic classifiers and KMB.

KME vs. Ensemble Classifiers
Here we experiment the KME version of the proposed algorithm, which combines ensemble classifiers instead of simple classifiers in an attempt to boost the predictive power of the approach.
For precision and recall, Table 6 shows that KME is dominating the top rates especially in regard to precision. Considering the G-mean results in Table 6, we can see that KME hits the best results in 19 datasets out of 21. For better visualization of the obtained results, radar Figures 8-10 are plotted for KME and the ensemble classifiers. By comparing the performance of KMB and KME in terms of G-mean results, Table 7 shows that the KMB achieves better results in 14 datasets out of 21 indicating that it is not necessary to apply the KME in all cases. This could be explained by the fact that KMB is a type of ensemble classification and it combines weak classifiers which could prevent overfitting. This is unlike the case of KME which combines powerful classifiers. Table 6. Precision, Recall, and G-mean for the ensemble classifiers and KME.   The boxplots for KMB and KME represented in Figures 11 and 12. These boxplots are created for reporting the G-mean of 30 independent runs for all datasets. The figures show that KMB and KME are very competitive and stable in most of the datasets. It can be noted that for few datasets KMB and KME exhibit more sensitive performance than the other datasets. Examples of these datasets are cm1, mc1, mw1 and pc2. This sensitivity can be due to the high imbalance ratio in these datasets, that is, misclassifying one instant from the rare class will highly impact the G-mean measure. For jedit-4.3 dataset, the single classifier approach failed to classify the rare instances therefore its recall is 0 and consequently the G-mean is 0. On the other hand, for the log4j-1.2 dataset, performance of of KMB and KME was worse than the other classifiers which could be reasoned to the clustering step which produced clusters that are harder for KMB and KME to cluster. Although the purposed approaches have been compared with the best traditional machine learning algorithms, we also compare the proposed KME with the results of one of the most recent sophisticated approaches called defect prediction based on deep forest (DPDF) [38]. Specifically, we used the common datasets used in their published results and in our approach, 13 datasets, to compare our approach against in terms of precision and recall as shown in Table 8. The results show that KME outperforms the DPDF in all datasets except for pc2 and tomcat-6.0.

Statistical Test
A nonparametric statistical test-The Friedman test-of multiple group measures is usually used to approve the null hypothesis that the multiple group measures have the same variance using a precise level of significance. Alternatively, rejecting the null hypothesis approves that they have different variance values. We analyze the execution of the Algorithms using the Friedman test in SPSS and we run the test 21 times using different datasets. For each experiment we used H0 that there is no difference in the execution between the Algorithms and H1 that there is a difference in the execution of the Algorithms. We reject H0 for p < α: as α = 0.05 is used as the significance level in this hypothesis testing. The results of Friedman test are shown in Table 9. From the results, we can see that the Mean ranks differ quite a lot in favor of the KME Algorithm and for the KMB Algorithm in almost all the experiments. The Chi-Square test statistics mainly summarize how differently our Algorithms were rated in a single number. The degrees of freedom in our experiments, 9 (Algorithms) variables − 1 = 8 degrees of freedom. The results show that there is a significant difference the execution of the Algorithms for all the experiments Since the p-value (Asymp. Sig.) < 0.05, and we cannot accept the null hypothesis of equal population distributions. Moreover, the whole table illustrates which Algorithm was ranked best versus worst. In other words, the Friedman test points out that our Algorithms were rated differently, Chi-Square values with a p-value ≤ 0.0000 for all experiments.

Conclusions
In this paper, a hybrid classification approach for software defect prediction was proposed. The main idea of this approach was to develop expert and robust classification models based on groups of similar patterns. Two versions were developed and experimented on. The first was based on simple classifiers, whereas the second was based on ensemble ones. After extensive experiments based on 21 well-known benchmark datasets, the evaluation results showed that the ensemble version of the proposed approach can significantly boost the prediction power compared to the other ensemble and basic classifiers in most of the datasets. The reason for this superior performance is that the proposed approach develops models that fit specific patterns that have similar behaviours.
For future work, two areas could be researched for improvement. The first is to explore more advanced clustering algorithms, and the second is to investigate techniques that can automatically determine the best number of clusters for each dataset. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.