Dissimilarity Space Based Multi-Source Cross-Project Defect Prediction

Software defect prediction is an important means to guarantee software quality. Because there are no sufficient historical data within a project to train the classifier, cross-project defect prediction (CPDP) has been recognized as a fundamental approach. However, traditional defect prediction methods use feature attributes to represent samples, which cannot avoid negative transferring, may result in poor performance model in CPDP. This paper proposes a multi-source cross-project defect prediction method based on dissimilarity space (DM-CPDP). This method not only retains the original information, but also obtains the relationship with other objects. So it can enhances the discriminant ability of the sample attributes to the class label. This method firstly uses the density-based clustering method to construct the prototype set with the cluster center of samples in the target set. Then, the arc-cosine kernel is used to calculate the sample dissimilarities between the prototype set and the source domain or the target set to form the dissimilarity space. In this space, the training set is obtained with the earth mover’s distance (EMD) method. For the unlabeled samples converted from the target set, the k-Nearest Neighbor (KNN) algorithm is used to label those samples. Finally, the model is learned from training data based on TrAdaBoost method and used to predict new potential defects. The experimental results show that this approach has better performance than other traditional CPDP methods.


Introduction
The defect prediction model is helpful for software testers to allocate limited resources to the most error-prone software modules [1], which is important to the field of software quality assurance.Within-project defect prediction (WPDP) builds prediction models based on sufficient data from a software history store within the project, and these data satisfy the same statistical distribution.However, it is difficult to obtain sufficient training data in a new project, and as the company evolves, the previous data may no longer be applicable [2].
Cross-project defect prediction (CPDP) can solve the above problem in WPDP, and the method uses other project data to train prediction models, but due to the different development programming methods and other aspects of different projects, the distribution of source and target data sets are different [3].Zimmermann et al. pointed out that the processing of data characteristics and processes is the key factor of CPDP method [1].CPDP uses transferring learning methods to obtain useful information from the source domain that is similar to the distribution of the target data.This method can satisfy the same distribution hypothetical requirement between training and test data [4].However, selecting source domain datasets randomly may result in negative transferring and poor performance because of low correlation with the target set [5].The multi-source cross-project defect prediction method is proposed by researchers to reduce the effects of source component shift.
Experts have noticed that (dis)similarity representation can enhance the expression ability of samples, and the performance of classifier could be improved in pattern recognition [6,7].The reason is that the dissimilarity is a crucial factor in recognition and categorization.By establishing the dissimilarity space, the original attributes of samples are replaced by the dissimilarity between pairs of samples.This method retains the original information of the dataset and makes each sample obtain the relationship with other objects, so it can enhance the discriminant ability of the sample attributes to the class label.The dissimilarity representation method maps samples into the dissimilarity space, in which several representative samples are selected from the target dataset as the prototype set.The samples of the dissimilarity space are calculated from the source domain, the target set and the prototype set.This method avoids the effects of different measurement standards for the same metric and reduces the dimension for classifiers [7].How to build the training dataset in the dissimilarity space is critical to improving the performance of the prediction model.
In this paper, a multi-source cross-project defect prediction method based on dissimilarity space (DM-CPDP) is proposed.This DM-CPDP method has three phases.In the dissimilarity space constructing phase, we select some representative samples from the target set as the prototype set.Then, the arc-cosine kernel function is used to measure the sample dissimilarity between the prototype set and the source domain or target datasets.In the training dataset selection phase, the earth mover's distance (EMD) method is used to calculate the cost, which is required to convert the data of the source domain, and in the dissimilarity space we select the corresponding samples with a small cost as the training set.In the last phase, we use the KNN method to assign labels to the unlabeled target samples, and the defect prediction model is constructed by the TrAdaBoost method in the dissimilarity space.The contributions to this paper can be highlighted as follows:

•
In order to construct dissimilarity space, we propose a framework which uses the density-based clustering method to select all density center samples of the target set as the prototype set.
Then the arc-cosine kernel function is utilized to measure the sample dissimilarity between the prototype set and the source domain or target datasets.The constructed dissimilarity space can represent the relationship between the multi-sources domain data and the target data.

•
In order to construct the defect prediction model, we propose the DM-CPDP method.After constructing the dissimilarity space, this method uses the EMD method to select the samples in the dissimilarity space as the training dataset.For the unlabeled samples converted from the target set, the KNN algorithm is used to label those samples.TrAdaBoost method is utilized to establish the prediction model based on the selected training dataset.The DM-CPDP method avoids using the original feature space to construct the defect prediction model.In the traditional defect method, the defect prediction model makes multi-source domain data be fully utilized.

•
To evaluate the performance of the DM-CPDP method, we compare it with other classic single-source and multi-source CPDP methods, such as TNB [8], TrAdaBoost [9], MsTrA [10], and HYDRA [11].In addition, we compare the effects of several different dissimilarity measures and prototype selection methods.The necessity of multi-source data sets is also verified.In terms of performance metrics, we use F-measure, AUC (Area Under ROC Curve), and cost-effectiveness to measure the models performance.The experimental results show that the DM-CPDP method is much better than the other methods.
The remaining parts of this paper are as follows: This paper shows the related work in Section 2. The details of the DM-CPDP algorithm is described in Section 3. We verify the performance of the DM-CPDP method and analyze the experimental results in Section 4. The final part is the summary of this paper.

Related Work
Many researchers have stated in their papers that identifying defects in software as early as possible has great economic value [12].Defect prediction is to establish a prediction model based on a software defect metric element in the early stage of defect detection.Traditional defect prediction is based on historical data within the project.Due to the difficulty of collecting the historical data, some new projects do not have historical data.Many researchers focus on cross-project defect prediction [9,12,13], and a series of classical CPDP methods are proposed.The key factor of CPDP is how to obtain the same distribution of data as the target set by transferring learning in the CPDP method.In order to realize positive transferring, the CPDP method based on multi-source dataset has attracted the attention of experts.In addition, Pekalska et al. [14] proposed that an appropriate representation of sample can improve the performance of the classifier, and it has been proved that classification prediction in dissimilarity spaces can achieve better results [15].Therefore, this paper establishes the prediction model in the dissimilarity space.

Multi-Source CPDP
In recent years, multi-source CPDP methods are beginning to be considered to improve the performance of defect prediction [16,17].Compared with a random selection of a single data set from the source domain as a training set, the multi-source CPDP method can avoid negative transferring, over-fitting problems, and make better performance of transfer learning [11].Many researchers have proved that the multi-source cross-project software defect prediction can improve the performance of the prediction model and enhance generalization.
Yao and Doretto [13] proposed two multi-source CPDP methods: the first one is to establish multiple candidate weak classifiers according to the source domain data set in each iteration, and select the one with the lowest error rate as the weak classification of the current iteration; The second method is divided into two stages: firstly, all candidate weak classifications are obtained through multiple iterations, and then the classifier with the smallest error rate among the candidate weak classifiers is used as the current weak classifier in each iteration.
Yu et al. [10] proposed a multi-source TrAdaBoost (MsTrA) approach to construct prediction model.This method calculates the similarity between each candidate source data set sample and the target sample and gives weight to each sample at first.Then, on the basis of the TrAdaBoost algorithm, the classifier with the smallest error rate is selected as the weak classifier in each iteration, and the prediction model is obtained through multiple iterations.
Xia and David et al. [11] proposed HYDRA method, which constructs a multi-source CPDP model through two stages.In the first stage, they merge each candidate source and target dataset as the training set to construct multiple base-classifiers, then use the genetic algorithm (GA) to find the optimal combination of base-classifiers.The second phase is similar to the AdaBoost algorithm, and the GA model is generated in each iteration.Finally, a linear combination of GA models is obtained as a prediction model.
He et al. [18] based on the HYDRA method proposed the S 3 EL method that constructs basis classifier with feature mean values, and after that, GA is also used to assign weights to the basis classifier.Experiments show that those methods are superior to the conventional single-source CPDP methods.
The above methods provide important research approaches and ideas for multi-source CPDP research, but there are still some shortcomings.The MsTrA algorithm selects the optimal weak classifier in each iteration, which is prone to over-fitting problems.The HYDRA and S 3 EL method use the GA to search for the optimal weight combination of the classifiers in each iteration, thus the time complexity is higher.Therefore, this paper focus on solving the problems of data noise, generalization, and time complexity.

Dissimilarity Representation
Experts proposed using the pairwise dissimilarity between objects to represent the object and found that it can improve the performance of classifiers [19].The traditional sample representation method uses feature attributes to represent a sample in the feature space.The disadvantage of this method is that the sample dimension is higher, and it is easy to cause the curse of dimensionality.The dissimilarity representation of samples can solve the problem, and the method represents a sample by calculating the dissimilarity relationship between the sample and the prototype set from the target set.The dimension of the sample is determined by the number of samples in the prototype set.Studies have shown that using the dissimilarity-based sample representation method can improve the performance of the defect prediction model.Compared with traditional modeling methods, this method can improve the performance by 1.86%-9.39%[20].In addition, once the dissimilarity space has been constructed, many methods can be used to assign labels to unlabeled samples so that the model construction problem can be solved with a general standard supervised learning method [20].
The dissimilarity representation method mainly includes two factors: prototype set selection and dissimilarity transformation.The representation method uses a transformation function to map samples into dissimilarity spaces according to the prototype set.The dimensions of the space are determined by the number of samples contained in the prototype set, and the difference from the j-th prototype sample can be regarded as the j-th attribute in the space.Each attribute of each sample in the space is the dissimilarity between the sample and the prototype sample [7].
On the selection method of prototype set, there are many excellent methods, such as the nearest neighbor method, random selection method, and cluster-based linear programming method.The most conventional method uses random selection, but this method is uncertain and leads to information loss.Clustering algorithm can solve this problem, but the KNN usually needs to customize the k value, and cannot find all the representative samples.Alex and Alessandro [21] proposed a density peaks clustering (DPC) method which uses the density peak points of the data set to find the cluster center quickly.Therefore, this paper uses the DPC algorithm to select the representative sample as the prototype set.
In the problem of dissimilarity transformation, it is common to use a distance-based method to measure the relationship between samples and map the sample set into a n × m matrix, such as Euclidean distance, Mahalanobis distance, and Manhattan distance [15,19].However, the accuracy will be affected if the measurement units are different for the above methods.For the cosine similarity, when the angle between vectors is the obtuse angle, these two vectors are irrelevant and the cosine value is negative.So the cosine similarity is not suitable for training the prediction model.The kernel method can describe the relationship between two vectors in essence, and it is more accurate than the distance-based method.Arc-cosine [22] is one of the kernel functions, and it represents the relationship between vectors by measuring the angle between vectors.Compared with cosine similarity measurement, this method is more intuitive in representing the relationship between vectors.Compared with radial basis function (RBF), this method does not need to adjust the parameters when only expressing the difference between vectors.
In order to improve the performance of the predictive model, the source domain dataset and the target set are used to construct the dissimilarity space, in which each sample is represented as the relationship between the source sample and the prototype set.The prototype set is obtained from the target set.

DM-CPDP Approach
Comparing with the traditional methods of building prediction models in the feature space, an alternative way is to construct prediction models in the dissimilarity space, in which each sample is described by pairwise dissimilarity relations between original data and the prototype set.In this paper, the kernel tool is used to measure the pairwise dissimilarity between candidate training data and prototype set, and then the data are mapped into the dissimilarity space.In the space, we achieve the transferring of multi-source datasets and assign labels to unlabeled target samples by a classification method, so that the process of model construction can use the standard supervised learning method.In order to offset the influence of useless features on the dissimilarity representation of samples, features need to be selected before the dissimilarity transformation.This paper chooses the FECAR [23] method for feature selection.
The overall process is shown in Figure 1.The process is divided into three phases.The first phase focus on constructing dissimilarity space.The density-based clustering method is used to calculate the cluster center of samples in the target set, and the prototype set is formed by those cluster center samples.Then, the arc-cosine kernel is used to calculate the sample dissimilarities between the prototype set and the source domain or the target set to form the dissimilarity space.In the second phase, the earth mover's distance (EMD) method is used to calculate the cost, which is required to convert the data of the source domain, and in the dissimilarity space we select the corresponding samples with a small cost as the training set.In the third phase, the KNN method is used to assign labels to unlabeled target samples before building the prediction model, and then the TrAdaBoost method is used to construct the prediction model.In addition, the test set is also mapped into the dissimilarity space to obtain the predicted results.
Algorithms 2019, 12, x FOR PEER REVIEW 5 of 23 the transferring of multi-source datasets and assign labels to unlabeled target samples by a classification method, so that the process of model construction can use the standard supervised learning method.In order to offset the influence of useless features on the dissimilarity representation of samples, features need to be selected before the dissimilarity transformation.This paper chooses the FECAR [23] method for feature selection.The overall process is shown in Figure 1.The process is divided into three phases.The first phase focus on constructing dissimilarity space.The density-based clustering method is used to calculate the cluster center of samples in the target set, and the prototype set is formed by those cluster center samples.Then, the arc-cosine kernel is used to calculate the sample dissimilarities between the prototype set and the source domain or the target set to form the dissimilarity space.In the second phase, the earth mover's distance (EMD) method is used to calculate the cost, which is required to convert the data of the source domain, and in the dissimilarity space we select the corresponding samples with a small cost as the training set.In the third phase, the KNN method is used to assign labels to unlabeled target samples before building the prediction model, and then the TrAdaBoost method is used to construct the prediction model.In addition, the test set is also mapped into the dissimilarity space to obtain the predicted results.

Dissimilarity Space Construction
To construct a CPDP model in dissimilarity space, we represent the sample by the pairwise dissimilarity between the sample and the prototype set R obtained from the target set T X .
. In dissimilarity space, the set is represented as: Each instance is represented as a r-dimensional dissimilarity vector.) , ( represents the dissimilarity between the i-th sample in data set X and the j-th prototype sample in prototype set R. This paper uses the density-based clustering method to select the cluster center as the prototype set, and uses the arc-cosine function to measure the pair-wise dissimilarity between samples.

Dissimilarity Space Construction
To construct a CPDP model in dissimilarity space, we represent the sample by the pairwise dissimilarity between the sample and the prototype set R obtained from the target set X T .R = {p 1 , p 2 , . . . ,p r }, R ∈ T. In dissimilarity space, the set is represented as: Each instance is represented as a r-dimensional dissimilarity vector.k(x i , p j ) represents the dissimilarity between the i-th sample in data set X and the j-th prototype sample in prototype set R. This paper uses the density-based clustering method to select the cluster center as the prototype set, and uses the arc-cosine function to measure the pair-wise dissimilarity between samples.

Prototype Selection
To select the representative samples as the prototype from the target set, we use the clustering method to select the cluster center as the prototype sample to form the prototype set.Traditional clustering methods require defining the number of clusters artificially, and they cannot select all representative samples in the dataset.So we use the DPC method to cluster samples in this stage.The cluster center has two characteristics: high density and large distance.Cluster centers are surrounded by neighbors with lower local density and relatively large distances from other high-density points.
In the first step, we calculate the local density ρ i and local distance δ i for each point.
The ρ i is equal to the number of points that are closer than d c to point i.The local density calculation method is as follows: The distance d ij between sample points is measured by Euclidean distance.d c is the cutoff distance which ensures each point has at least 2% of the total points as its neighbors, and d c > 0.Where χ(x) = 1 if x < 0 and χ(x) = 0 otherwise.
The local distance calculation method is as follows: The local distance has two cases: when the point x i with highest local density, δ i is the greatest possible distance with others; otherwise, δ i is the distance from x i to the nearest data point with greater local density.
Then, the cluster center is selected by considering δ i and ρ i in combination, and the method is shown in Formula (4).δ i and ρ i maybe in different orders of magnitude, we normalize these two quantities, z(•) is the normalized process.As shown in Figure 2, the points are sorted in descending order of γ i and represented in the form of a graph.The value of the non-cluster centers are relatively smooth and there is a clear jump from the non-cluster center to the cluster center.The points above the inflection point are chosen as the cluster centers.

Prototype Selection
To select the representative samples as the prototype from the target set, we use the clustering method to select the cluster center as the prototype sample to form the prototype set.Traditional clustering methods require defining the number of clusters artificially, and they cannot select all representative samples in the dataset.So we use the DPC method to cluster samples in this stage.The cluster center has two characteristics: high density and large distance.Cluster centers are surrounded by neighbors with lower local density and relatively large distances from other high-density points.
In the first step, we calculate the local density i The distance ij d between sample points is measured by Euclidean distance.
c d is the cutoff distance which ensures each point has at least 2% of the total points as its neighbors, and The local distance calculation method is as follows: The local distance has two cases: when the point    The method is shown in Algorithm 1.In step 3-7, we calculate the distance between samples, and sort the values of distance in ascending order.Then, the value of the 2%th data are selected as the cut-off distance d c in step 6-9.In step 10-14, the local density ρ i and local distance δ i are calculated for each sample, and calculate the cluster center decision factor γ i according to δ i and ρ i .Last, we sort the value of γ i in descending order to select the cluster centers as the prototype set.computer the candidate cluster center γ i as cluster center according to Equation (4) 14 end 15 Sort {γ i } m in descending order 16 select all the cluster center as the prototype set R according to step15; 17 return R r×k

Dissimilarity Transformation
The novelty of this paper is to use a kernel function to interpret the dissimilarity between samples in dissimilarity spaces.The kernel tools can represent a certain relationship between two objects.When relaxing the requirement for Mercer kernels, there are more powerful dissimilarity measures can be defined in the domain [24].With the goal of better representing the dissimilarity between vectors, the arc-cosine kernel is used to measure the dissimilarity between samples.According to the analysis of the arc-cosine kernel, we use k 0 (x, y) to measure the pair-wise dissimilarity, short in k(x, y), and shown as: In the transformation process, we first use the arc-cosine kernel function to measure the dissimilarity between each original sample and the prototype set, and these values are used as row vectors, and then the low-dimensional dissimilarity space is constructed.The process is shown in Figure 3 and Equation (6).The dimension of this space is determined by the number of samples in the prototype set R. k(x i , p j ) represents the dissimilarity between the sample x i and the prototype sample p j .

19
Return D Each sample in the dissimilarity space is represented as: The above methods are used to map each source domain and target dataset into the dissimilarity space.The dissimilarity space construction method is shown in Algorithm 2. Step 3-9 is the process of mapping a source domain data set into the subspace.In this stage, x i is the dissimilarity representation of the sample x i .It is obtained by calculating the dissimilarity between each source data sample and the prototype set.Then we add x i to the subspace D(X Su , R).
Step 13-17 is the process of mapping target set into dissimilarity, and the process is similar to the source domain dataset.Finally, the dissimilarity space D is obtained by combining these subspaces.

Selection of Multi-Source Datasets
To select the appropriate data sets, the earth mover's distance (EMD) is used to measure the similarity between the sets.The EMD is defined as the minimum amount of work that required to convert a data distribution to another, i.e., assuming X is a warehouse containing n mound and R is a warehouse with p empty pits.The similarity between datasets is the minimal cost to move n mound in X to p pits in R.
In this stage, the dataset can be represented as X = {(x 1 , w x 1 ), . . ., (x i , w x i ), . . ., (x n , w x n )}.w xi is the weight of x i .When using the EMD method, we assume each sample x i is a mound with the quality w xi .Similarly, the prototype set is represented as R = (p 1 , w p 1 ), . . ., (p j , w p j ), . . ., (p r , w p r ) .w pj is the weight of p j .In the EMD method, we assume each sample p j is an empty pit with the volume of w pj .
We consider that the quality of each sample in the same set is equal and the total mass is equal to the total capacity, and the conditions are shown in (8).d ij is the cost of moving x i to p j , and the calculation method as shown in (9).k(x i , p j ) comes from the matrix (7).We consider the cost function D XR from X to R as the similarity between the source data set and the prototype set, that is, the similarity between the source data set and the target set.The cost function D XR refers to the Equation (10).
f ij defines the flow from x i to p j .The method aims to find an optimal solution f ij to minimize the overall cost function, and it is subjected to the following constraints: The method of data set selection is shown in Algorithm 3. In step 2-6, we calculate the cost function of converting the source domain data set into the same data distribution as the prototype set.In step 8-10, the values of cost function D(X Su , R) are sorted in descending order, and the first α datasets are selected as the training set.

Model Construction
Since the values of all sample attributes are in [0, 1] after dissimilarity transformation, the KNN method can be used without being affected by noise in the dissimilarity space.Each attribute of the sample represents the degree of dissimilarity between the original sample and the prototype sample in the dissimilarity space, and the more similar to the prototype, the greater the value we get.So that the KNN method is suitable for assigning labels in the space.This method finds k points closest to the target sample.Then, the voting method is used to assign a label to the target sample, and the voting method is calculated by the Equation (12).Therefore, statistical-based machine learning algorithms can be used to build prediction models.In order to achieve better transferring, we use the classic TrAdaBoost algorithm to build the prediction model.
I(•) is the number of samples marked as positive (negative) in k samples.sign(x) is a symbolic function, where sign(x) = 1 if x ≥ 0 and sign(x) = 0 otherwise.

DataSet
In order to verify the effectiveness of the DM-CPDP method, we validate the performance of the prediction model through experiments and compare it with other methods.In this paper, we selected 14 datasets in NASA and 3 datasets in SOFTLAB to experiment.The datasets of NASA and SOFTLAB are obtained from PROMISE repository.The PROMISE repository is constantly updating datasets.Compared to the datasets used in some of the previous papers [8,12], the current datasets have some variation in the number of samples, but its internal structure has not changed to some extent.Since this paper constructing cross-project software defect prediction model based on multi-source datasets, the impact of some change in the number of samples on the experimental results is limited.To facilitate comparison with previous studies, we used previous versions of datasets maintained by the PROMISE repository.These datasets are shown in Table 1.These datasets are derived from different projects, and the data distribution as well as feature attributes are different.We choose the common attributes of the target set and the source domain dataset to build the prediction models.Table 2 shows the metrics of the software features used in the experiment.It mainly includes McCabe metric element, Line Countmetric element, Halstead basic metric element, and its extension DHalstead metric element.To facilitate comparison, we performed two parts of the experiment on the data sets: NASA to NASA, NASA to SOFTLAB.
NASA to NASA: Only the NASA is used as the target set and the training set.Each dataset belonging to the NASA is chosen as the target set, and the remaining data sets are used as the training set.
NASA to SOFTLAB: All datasets belonging to NASA are used as source domain datasets, and datasets in SOFTLAB are used as the target sets.

Performance Index
In this paper, the prediction results are measured according to the indexes of F-measure, AUC (Area Under ROC Curve), and cost-effectiveness.The performance index is calculated based on the confusion matrix shown in Table 3. True Negative (TN): represents the number of negative samples predicted as negative classes.F-measure is determined by both recall and precision, and its value is close to the smaller value of both.So that the larger F-measure means that both recall and precision are larger, the formula is shown in (13), where α is used to regulate the relative importance of precision and recall, and its value is usually at 1.
Recall is the ratio of correctly predicting the number of defective modules to the number of real defective modules, indicating how many positive samples are correctly predicted.
Precision is the ratio between correctly predicting the number of defective modules and the number of all predicted defective modules, indicating how many predictions are correct in positive samples.
AUC is defined as the area under the receiver operating characteristics (ROC) curve.This performance indicator is one of the criteria for judging the two-category model.
Cost-effectiveness refers to maximizing benefits by spending the same cost.Cost-effectiveness measures the percentage of defects that can be identified by the predictive model by examining the top 20 percent of the samples [11,25].Zhang et al. consider that the cost of software includes not only the effort of inspecting the defective modules, but also the failure cost of classifying a defective sample as a non-defective sample [26].Incorrectly predicted defective samples will have a greater impact on software.Therefore, they propose a measurement method based on confusion matrix, and the cost-effectiveness is calculated according to the Equation ( 16).Compared with other methods, this method is more concise, and it is not affected by the order and the size of defect modules.The smaller the value means the lower the false negative of the prediction result of the model, and the more defective modules can be correctly predicted.So, the cost of failure caused by the software in the later stage is being lower.cost-effectiveness = FN FN + TN (16)

Analysis of Results
By comparing with traditional single-source and multi-source defect prediction methods, we prove that establishing prediction models in the dissimilarity space can improve the performance of the prediction model, and prove the superiority of the DM-CPDP method.The construction of classifier in the dissimilarity space is mainly influenced by two factors: dissimilarity metric and prototype selection.We compared the effects of different dissimilarity metric and prototype selection methods on the experimental results.In addition, the necessity of multi-source data in CPDP is also be verified.

Experiment on Different Methods
In order to verify the superiority of the DM-CPDP method, we compare it with traditional CPDP models.
In this part, we conduct experiments on NASA to NASA and NASA to SOFTLAB respectively.Among many traditional CPDP methods, this paper chooses three classical comparison methods, such as TNB, MSTrA, and HYDRA method.In each experiment, 90% of the source data set and target set are selected randomly as the training set, and the data of remaining target set are used as the test set.Each experiment was repeated 10 times, and the average values of these experiments are obtained from the results.The experimental details of each method are as follows: TNB: randomly selects a dataset as the training set, and assigns weights to the training samples according to the target set using data gravity.Finally, the defect prediction model is built using the NB algorithm.
MSTrA: selects multiple data sets as training set and distributes weights for each training data according to the target set using data gravitation.In each iteration, each source domain data set is matched with the target set to train weak classifiers, and the weak classifier in the current iteration with the lowest error rate on the target set is selected.
HYDRA: merges each candidate source and target dataset as the training set to construct multiple base-classifiers, and then use the genetic algorithm to find the optimal combination of base-classifiers.After that, the process which is similar to the AdaBoost algorithm is used to obtain a linear combination of GA models.
The results are shown in Tables 4-6, bold numbers indicate optimal results on a dataset.It can be seen that the DM-CPDP method outperforms several existing algorithms on most data sets.In the NASA to SOFTLAB experiments, the performance of each algorithm on SOFTLAB data sets is stable.DM-CPDP method is better than the other algorithms.The average value of F-measure and AUC are 2.8%-27.3%and 1.7%-7.8%respectively, which is higher than other algorithm.In addition, the average value of the cost-effectiveness for the DM-CPDP method is 1%-4.9%,which is lower than other algorithms.
In the NASA to NASA experiment, the average value of DM-CPDP method on F-measure and AUC are higher than other methods: 4.4%-28.8%,5.5%-13.0%respectively.The average value of DM-CPDP method on cost-effectiveness is lower than other methods 0.2%-1.8%.For the three metrics of F-measure, AUC, and cost-effectiveness, they are 9, 9, and 10 datasets respectively showing the best performance on the DM-CPDP method.For the HYDRA, there are 4, 4, and 3 datasets respectively showing the best performance.For the MSTrA method, only one dataset shows the best performance on the F-measure indicator.For the TNB method, there is one dataset performs optimally on AUC and cost-effectiveness respectively.
The reason for the above phenomenon is the influence of data distribution.Different machine learning methods behave different on the same data set, and the same machine learning method performs differently on different datasets.DM-CPDP is still superior to other algorithms in general.The reason is that this method uses multi-source datasets and dissimilarity representation method.By comparing with these three classic CPDP methods, it can be proved that DM-CPDP method performs well in the field of CPDP.

Experiments on Multi-Source Data Sets and Dissimilarity Space
In order to verify the impact of multi-source data and sample dissimilarity representation on the performance of predictive models, we compare the TrAdaBoost, Multi-source TrAdaBoost, dissimilarity space based single-source CPDP (DS-CPDP) method, and DM-CPDP method.All the comparisons base on F-measure, AUC, and cost-effectiveness to complete.Each experiment was repeated 10 times, and the average values of these experiments are obtained from the results.The experimental details of each method are as follows: TrAdaBoost: randomly selects 90% of the source domain data set and target set as the training set, and the remaining target set as the test set.The prediction model was built using TrAdaBoost.
Multi-source TrAdaBoost: before using the TrAdaBoost algorithm to build a prediction model, the EMD method is used to select multiple data sets that are highly correlated with the target set, then these data sets are combined with the target set as a training set.Finally, the TrAdaBoost algorithm is used for modeling.
DS-CPDP: randomly selects a data set from the source domain for dissimilarity transformation as the training set, and then uses KNN method to assign labels to unlabeled target, after that use the TrAdaBoost method to build the model.
The experimental results are shown in Tables 7-9, bold numbers indicate optimal results on a dataset.
In order to prove the importance of multi-source data, we verify it in the dissimilarity space and the feature space respectively.By comparing the results of DM-CPDP and DS-CPDP method, it can be found that DM-CPDP is superior to DS-CPDP in the three indexes of F-measure, AUC, and cost-effectiveness in the dissimilarity space.The average value of F-measure is higher than the DS-CPDP method by 6.4% and 3.2% in the two series of datasets respectively.The average value of AUC is higher than the DS-CPDP method by 7.0% and 5.4%.The average value of cost-effectiveness is lower than the DS-CPDP method by 1.1% and 0.4%.By comparing the results of TrAdaBoost and Multi-source TrAdaBoost, it can be found that the Multi-source TrAdaBoost method is better than the TrAdaBoost in the feature space, and the average value of F-measure is higher than the TrAdaBoost method by 2.1%, 8.9%.The average value of AUC is higher than the TrAdaBoost method by 2.3%, 7.6%.The average value of cost-effectiveness is lower than the TrAdaBoost method by 3.5%, 0.8%.These results prove that the CPDP method based on multi-source data can improve the performance of prediction models, whether in the dissimilarity space or in the feature space.The reason why multi-source data is superior to single-source data is that multi-source method in CPDP can not only increase the useful information by providing sufficient data, but also effectively avoid the problem of negative transferring.Besides, in the process of modeling, we also filter the multi-source datasets and select the data highly correlated with the target set as the training set.So that the data distribution of the training set and the target set is as similar as possible.Thus, multi-source data can improve the predictive performance of classifiers.
In order to verify that the construction of dissimilarity space can improve the performance of the classifier, these two sets of experiments are compared: TrAdaBoost and DS-CPDP, Multi-source TrAdaBoost and DM-CPDP.From the experimental results in Tables 7-9, the average value of F-measure on DS-CPDP is 13.2% and 11.4% respectively, which is higher than the TrAdaBoost algorithm.The average value on AUC is higher than the latter by 8.2%, 10.6%.The average value on cost-effectiveness is lower than the latter by 5.7%, 1.3%.By comparing DM-CPDP with Multi-source TrAdaBoost, we can find that the former is better than the latter in the average value of the two performance indicators.The average value of F-measure in the former is higher than that in the latter 17.5% and 5.7% respectively.In terms of AUC, the former is better than the latter 12.9% and 8.4% respectively.The average value on cost-effectiveness is lower than the latter by 3.3%, 0.9%.Therefore, it can be proved that the model established in the dissimilarity space is better than that built in the feature space.
There are three reasons why constructing dissimilarity space can improve the performance of prediction models.
Firstly, the classification algorithm essentially establishes the classifier by analyzing the intrinsic relationship between the feature attributes and the class labels in the datasets.If the feature attributes of samples have the weak discriminant ability to class labels, the performance of the prediction model will be affected, but building a dissimilarity space can solve this problem.When constructing the space, we use the dissimilarity between samples instead of the original feature attribute, so that the intrinsic structure information of the dataset can be obtained.Thus, the discriminant ability of sample attributes to class labels is enhanced.
Secondly, the DM-CPDP method uses the data from the target set as the prototype set when constructing the dissimilarity space.Therefore, when mapping the samples belonging to the source domain to the dissimilarity space, it essentially carries out comprehensive transferring learning.So, the training set with the same data distribution as the target set can be obtained, which better meets the requirements of the hypothesis of the same data distribution.
Finally, when using the source domain and target domain data to construct dissimilarity space, each sample is represented as the dissimilarity between the sample and the prototype set.If the dissimilarity between the samples is small, a larger attribute value can be obtained.So that these samples with high similarity to the target set will gain higher attention during the modeling process, and the performance of the classifier will be improved.
By analyzing the above reasons, we can conclude that using multi-source datasets and dissimilarity space can improve the performance of the classifier, and the experimental results also prove this conclusion very.

Different Dissimilarity Metric Method
In this part, we compare the effects of several different dissimilarity metric methods on the experimental results and verify which measurement is efficient in CPDP.
Figure 4 is the box plot of performance indicators, which compare the values of several different dissimilarity metric methods on F-measure, AUC, and cost-effectiveness.Euclidean distance, Manhattan distance, and correlation coefficient are chosen as the measurement of dissimilarity.
Algorithms 2019, 12, x FOR PEER REVIEW 18 of 23 domain to the dissimilarity space, it essentially carries out comprehensive transferring learning.So, the training set with the same data distribution as the target set can be obtained, which better meets the requirements of the hypothesis of the same data distribution.Finally, when using the source domain and target domain data to construct dissimilarity space, each sample is represented as the dissimilarity between the sample and the prototype set.If the dissimilarity between the samples is small, a larger attribute value can be obtained.So that these samples with high similarity to the target set will gain higher attention during the modeling process, and the performance of the classifier will be improved.
By analyzing the above reasons, we can conclude that using multi-source datasets and dissimilarity space can improve the performance of the classifier, and the experimental results also prove this conclusion very.

Different Dissimilarity Metric Method
In this part, we compare the effects of several different dissimilarity metric methods on the experimental results and verify which measurement is efficient in CPDP.The experimental results show that the prediction model using Manhattan distance and Euclidean distance as the measurements of dissimilarity are poor.The value of median, quartile, maximum, and minimum on F-measure and AUC are lower than the arc-cosine kernel.These values The experimental results show that the prediction model using Manhattan distance and Euclidean distance as the measurements of dissimilarity are poor.The value of median, quartile, maximum, and minimum on F-measure and AUC are lower than the arc-cosine kernel.These values on cost-effectiveness are higher than the arc-cosine kernel.During the course of the experiment, we found that the average values of the arc-cosine kernel method on F-measure and AUC are still higher than those three methods, and cost-effectiveness index of arc-cosine kernel method is lower than other methods.In addition, it can be seen from the box plot that when the correlation coefficient is used as the measuring method, the value of median, quartile, maximum, and minimum on F-measure, AUC, and cost-effectiveness are still inferior to the arc-cosine kernel, but superior to the Manhattan and Euclidean distances.
The reasons for these results are as follows.When the relationship between samples is measured by Euclidean distance and Manhattan distance, if the sample and the prototype data are highly correlated, the attribute value of the sample in the dissimilarity space is lower.So that the samples with the same distribution to the target set receive less attention in the process of modeling.Correlation coefficient as the measurement is opposite to the above two methods.This method makes the same distribution samples get higher attention when building the classifier.The arc-cosine kernel function is superior to other methods because the kernel function can better represent the intrinsic relationship between samples, and more attention can be paid to the data that is highly correlated with the target sample in the process of modeling.So it can be concluded that using the arc-cosine kernel function as the measuring method of dissimilarity between samples works better.

Different Prototype Set Selection Methods
In this part, we compare the effects of several different prototype set selection methods on the experimental results.
Figure 5 compares three different methods of prototype set selection, namely random algorithm, the K-means algorithm, and the DPC method for DM-CPDP.For the random method, researchers generally select 3%-10% of the dataset [27].When using the random method to select the prototype set, we select r samples as the prototype set, r = log I (I is the number of the instance), and repeat the results 10 times to take the mean.When using the K-means method, we also select r samples as the initial cluster center for clustering and select the output cluster center as the prototype set.
It can be seen from the box plot that the density-based prototype selection method used in this paper is better than the random selection method and the K-means clustering method.In Figure 5, the density-based prototype selection method used by the DM-CPDP has higher value of median, quartile, maximum, and minimum on the F-measure and AUC than the other two methods.These cost-effectiveness values for density-based prototype selection method are lower than other methods.However, the K-means clustering method performs the worst, so this method is not suitable for prototype selection.
The reason for this problem is that the K-means algorithm is not suitable for solving non-spherical clusters and is greatly affected by outliers.Although the average effect of random selection is ideal, the prediction results are unstable due to the randomness of the instance selection.Therefore, it is reasonable to use the DPC method for prototype selection.
with the target sample in the process of modeling.So it can be concluded that using the arc-cosine kernel function as the measuring method of dissimilarity between samples works better.

Experiment on Different Dataset Selection Methods
In this part of the experiment, we compare the effects of different dataset selection methods on the experimental results, bold numbers indicate optimal results on a dataset.
Since the samples have been mapped into the dissimilarity space, the dataset selection method based on extracting the feature vectors is no longer applicable.Thus, we compare the EMD with another method.In each dissimilarity subspace, we select the value of the smallest attribute in each sample and take the mean value to measure the similarity between the source domain data set and the prototype set.Then, the values are sorted in descending order and the first α datasets are selected as the training data.This approach is named as Method 1 [7].The calculation is as follows: The experimental results are shown in Table 10.The results show that the performance of dataset selection method based on EMD is superior to Method1 in F-measure, AUC, and cost-effectiveness.The reason for this phenomenon is that when comparing the similarity between datasets, Method 1 has lost more information by selecting the minimum attribute and averaging these attribute values, which leads to inaccurate measurement results.However, the EMD method takes into account each attribute of the sample, so the performance of the prediction model is better.

Threats to Validity
The main factor affecting the internal validity of the experiment is the deviation in the code implementation process.For example, we use the HYDRA algorithm for comparison.The algorithm needs to use the genetic algorithm in the implementation, but the random factors of the genetic algorithm may deviate from the experimental results.In view of this problem, this paper eliminates the randomness of the algorithm through repeated tests in the code implementation process.In addition, we compare the results of the reproduce algorithm with the data in the previous papers, which is basically consistent with the results in the previous papers.
The factors that influence the external validity of the experiment are the quality of the datasets and generalization.In view of the quality problem of dataset, we choose NASA and SOFTLAB from PROMISE, which are publicly available datasets often used by researchers.Each dataset contains different project data to reduce the overall impact of data quality on experimental results.In terms of generalization, the data we used to train the model were derived from two open source data sets.These two open source datasets contain 17 project data for a total of 19458 samples, which guarantee the credibility of the experimental.
The factors affecting the validity of the argument are mainly the selection of performance index and comparison algorithms.For the performance index factor, we use the F-measure, AUC, and cost-effectiveness to measure the performance of the model.F-measure is one of the most commonly used evaluation criteria in defect prediction, which can measure the balance between recall and precision.AUC is an indicator that evaluates the overall performance of the model.Cost-effectiveness is used to evaluate the cost of defect inspection in defect prediction.For the problem of comparison algorithms, we choose three representative algorithms, which are cited by many researchers as comparison algorithms.By comparing with these classic algorithms, we can prove the generality of our algorithm.

Conclusions
The contribution of this paper is to put forward the method of establishing the cross-project defect prediction module in the dissimilarity space, which provides a new research idea for CPDP.The basic idea of this paper is to use the density-based clustering method to automatically select the cluster centers from the target set as the prototype set.Then the arc-cosine kernel is used to calculate the dissimilarity between the prototype set and the source domain sample as well as the target sample to form the dissimilarity space.After that the training data set are selected by the EMD method, which calculates the cost of converting the data distribution in the source domain to the same data distribution as the target set, and the corresponding samples with a small cost are selected as the training set in the dissimilarity space.Finally, the prediction model is established by TrAdaBoost algorithm.
In the whole data processing process, we complete two transferring.The first transferring is the dissimilarity measure between samples.In the dissimilarity space, each attribute of each sample is the pairwise dissimilarity between the source domain sample and the prototype sample, and the source domain samples with high correlation to the prototype set have higher measurement values.This method is more flexible than the traditional method of assigning weights to the samples in the feature space.The second transferring is the selection stage of the datasets.In this stage, we make full use of the representation method of samples in the dissimilarity space and use the EMD method to measure the similarity between the source domain dataset and the prototype set.Experiments show that the DM-CPDP method has better prediction performance than the traditional methods.

Figure 1 .
Figure 1.The framework of the multi-source cross-project defect prediction method based on dissimilarity space (DM-CPDP).

Figure 1 .
Figure 1.The framework of the multi-source cross-project defect prediction method based on dissimilarity space (DM-CPDP).

 and local distance i 
for each point.The i is equal to the number of points that are closer than c d to point i.The local density calculation method is as follows:

ix
with highest local density, i  is the greatest possible distance with others; otherwise, i  is the distance from i x to the nearest data point with greater local density.Then, the cluster center is selected by considering i  and i  in combination, and the method is shown in formula (4).


maybe in different orders of magnitude, we normalize these two quantities, ) ( z is the normalized process.As shown in Figure 2, the points are sorted in descending order of i  and represented in the form of a graph.The value of the non-cluster centers are relatively smooth and there is a clear jump from the non-cluster center to the cluster center.The points above the inflection point are chosen as the cluster centers.  = (  ) × (  ) (4) (a) (b)

Figure 2 .
Figure 2. Cluster center selection diagram.(a) is the descending arrangement diagram of i  ; (b) is the

Figure 2 .
Figure 2. Cluster center selection diagram.(a) is the descending arrangement diagram of γ i ; (b) is the display of the cluster center point selected by the graph (a) in the data sample.

Figure 3 .
Figure 3.The process of data mapping.

Figure 3 .
Figure 3.The process of data mapping.

Algorithm 3 .
Selection of multi-source datasets Input: D(X Su , R): representation of source domain data set X Su in space D u: the number of source domain data set in space D Output: training dataset 1 for each D(X Su , R) n×r in D 2 for each x i in D(X Su , R) do 3 computer the move cost d ij according to Equation (9) 4 computer the optimum solution f ij according to Equation (8) and the restrictions of Equation (11) 5 computer the cost function D X Su R according to Equation (10) Su R u in descending order 9 select the first α dataset as the training dataset 10 return training dataset

Figure 4 .Figure 4 .
Figure 4. Box plot of different dissimilarity metric methods.(a) is the situation of each method on Fmeasure; (b) is the situation of each method on AUC; (c) is the situation of each method on costeffectiveness.Box plots represent maximum, upper quartile, median, lower quartile and minimum values from top to bottom.In addition, circles represent outliers.

Figure 4 .
Figure 4. Box plot of different dissimilarity metric methods.(a) is the situation of each method on F-measure; (b) is the situation of each method on AUC; (c) is the situation of each method on cost-effectiveness.Box plots represent maximum, upper quartile, median, lower quartile and minimum values from top to bottom.In addition, circles represent outliers.

4. 3 . 4 .Figure 5 .Figure 5 .
Figure 5. Box plot of different prototype selection methods.(a) is the situation of each method on Fmeasure; (b) is the situation of each method on AUC; (c) is the situation of each method on cost-Figure 5. Box plot of different prototype selection methods.(a) is the situation of each method on F-measure; (b) is the situation of each method on AUC; (c) is the situation of each method on cost-effectiveness.Box plots represent maximum, upper quartile, median, lower quartile, and minimum values from top to bottom.In addition, circles represent outliers. 7)

Table 4 .
Comparison of performance indicators F-measure.

Table 5 .
Comparison of performance indicators AUC.

Table 6 .
Comparison of performance indicators cost-effectiveness.

Table 7 .
F-measure value of different model construction methods.DS-CPDP: dissimilarity space based single-source CPDP.

Table 8 .
AUC value of different model construction methods.

Table 9 .
Cost-effectiveness value of different model construction methods.

Table 10 .
Different dataset selection method.