Research of Software Defect Prediction Model Based on Complex Network and Graph Neural Network

The goal of software defect prediction is to make predictions by mining the historical data using models. Current software defect prediction models mainly focus on the code features of software modules. However, they ignore the connection between software modules. This paper proposed a software defect prediction framework based on graph neural network from a complex network perspective. Firstly, we consider the software as a graph, where nodes represent the classes, and edges represent the dependencies between the classes. Then, we divide the graph into multiple subgraphs using the community detection algorithm. Thirdly, the representation vectors of the nodes are learned through the improved graph neural network model. Lastly, we use the representation vector of node to classify the software defects. The proposed model is tested on the PROMISE dataset, using two graph convolution methods, based on the spectral domain and spatial domain in the graph neural network. The investigation indicated that both convolution methods showed an improvement in various metrics, such as accuracy, F-measure, and MCC (Matthews correlation coefficient) by 86.6%, 85.8%, and 73.5%, and 87.5%, 85.9%, and 75.5%, respectively. The average improvement of various metrics was noted as 9.0%, 10.5%, and 17.5%, and 6.3%, 7.0%, and 12.1%, respectively, compared with the benchmark models.


Introduction
Software defect prediction is an indispensable part of software development because it can reduce the time and energy required for software testing during development. Software defect prediction is divided into two parts: the construction of software metrics [1], which is to count the features in the software code, and the model design, which is involved in the design of corresponding algorithms for different learning tasks and software metrics to achieve software defect prediction.
Traditional machine learning methods directly use software code features (such as changes in data and previous defects) to classify software defects. For example, Liu et al. [2] solved the cumulative unbalance problem using the SMOTE (synthetic minority oversampling technique) algorithm and solved the data noise problem using the ENN (extended nearest neighborhood) algorithm, as well as optimized the four-layer BP (backpropagation) network using the simulated annealing algorithm, and predicted the classification. Bashir et al. [3] proposed a feature selection method based on maximum likelihood logistic regression, which was beneficial to the selection of optimal feature subsets and can predict defect modules more accurately. Goyal [4] proposed a new filtering technique to effectively predict defects using support vector machines for the imbalanced data classification problem. The input of the prediction model based on machine learning is dependent on the (1) The application of the graph neural network in the complex network to make software defect prediction, followed by the use of the graph neural network to combine the structure of the software class graph along with the software's class-level measurement element (node-level features, e.g., prior fault and new data) to learn new feature vectors. This represents an additional consideration in our model, compared with previous models, which only considered software graph structure or software defect measurement elements. (2) Use of the community detection algorithm to decompose the software graph structure into multiple subgraphs, and use of all the subgraphs as the input of the graph neural network model. This further simplifies the software graph structure, and the learned graph structure is a closely related subgraph. (3) Improvement of the graph convolutional neural network, such that the graph neural network can learn the graph structure features that are conducive to software defect prediction.
The remainder of this paper is organized as follows: Section 2 introduces the background knowledge of software diagram structure, then introduces community detection algorithms, and finally proposes a framework for software defect prediction. Section 3 presents the experimental environment, evaluation metrics, experimental setup, and experimental procedure. Section 4 discusses the results. Section 5 provides the conclusions and future work.

Materials and Methods
The complex network [10] is a method for analyzing complex systems. Complex networks can abstract complex systems into graphs, and help people understand complex systems by analyzing some characteristics of graphs. Complex networks have been developed from the original Seven Bridges of Konigsberg problem [11] of network science. Telecommunication networks, computer networks, biological networks, cognitive semantic networks, social networks, etc., are all common complex networks in life, all of which are treated by different elements in the system as nodes, with connections between elements as edges.

Software Class Depends on the Network
The software is a complex system; hence, it can be easily abstracted as a network for analysis. The classes in the software source code are regarded as nodes in the network, while the dependencies between classes are regarded as the edges of the network [12]. In software defect prediction, the node itself also has software defect measurement meta-information; thus, the node information is also regarded as a part of the software graph network.

Community Detection
By using the network's structural information, community detection partitions the network into various smaller subnetworks. Nodes inside a community are closely connected, while nodes between communities are less connected. Depending on the type of network, community detection can be divided into two categories: static network community detection and dynamic network community detection [10]. The modularized community partitioning algorithm is a representation of the static network community partitioning technique. Modularity Q was first presented by Newman and Girvan [13] in 2004 to assess the effectiveness of community division. Numerous academics have devised analogous techniques by optimizing the Q-value in response to the modular Q suggestion. Among them, the Louvain algorithm [14] proposed by Blondel et al. is widely used because of its ability to quickly discover communities. The Louvain algorithm can be divided into two stages: (1) Every node starts off as a community. If a node's modular gain from its current community to the community of its neighboring nodes is more than zero, the node will become affiliated with the community of its adjacent nodes, and its community affiliation will change. On the other hand, the initial community will be preserved until any node's community change does not result in a modular gain that is more than zero. (2) A new network is created using the community acquired in the previous step as a node. The connection weight between nodes is the sum of all nodes in the original network between the two communities. The weight of the nodes, which have a self-circulation, is the total number of connections between the initial nodes in the community. When there is no gain update, step 1 is repeated for the new network.

Graph Neural Network
The processing object of the graph neural network is the graph which generally represents non-Euclidean relationships. The concept of a graph neural network (GNN) was proposed in 2005 [15]; later, in 2009, Dr. Scarselli [16] defined the theoretical basis of GNN. With the success of convolutional neural networks, scholars have thought about integrating the ideas of convolutional operators into GNNs, which are also known as graph convolutional neural networks (GCNs). There are two types of GCNs based on the spectral domain and spatial domain [17].
(1) GCNs based on the spectral domain include SCNN (spectral CNN) [18], ChebNet (Chebyshev spectral CNN) [19], and GCN [20]. The spectral domain convolution maps the graph topology into the spectral domain through discrete Fourier transformation, and then defines its graph convolution operator. The GCN convolution process can be represented by the following formula: where A = A + I N is the adjacency matrix of the undirected graph with added selfconnections, I N is the identity matrix, D ii = ∑ j A ij , W (l) is a layer-specific trainable weight matrix, H ∈ R N×D is the activation matrix of layer L, and H (0) = X. σ(·) denotes an activation function. (2) GCNs based on the spatial domain include GraphSAGE (graph sample and aggregate) [21], GAT (graph attention network) [22], and GIN (graph isomorphism network) [23]. Spatial convolution aggregates the feature vectors of the first-order adjacent nodes of a node and then combines them with feature vectors of the current node. The graph convolution formula of GIN is as follows: First, a graph G (V, E) is defined, in which v ∈ V, the feature vector of each node is X v . h (k) v denotes the representation vector of node v at the k-th layer, where k denotes the iteration level; h represents a group of adjacent nodes of v. MLP is a multilayer perceptron. ε is a learnable parameter or a fixed parameter.

Software Defect Prediction Model Based on Complex Network and Graph Neural Network
This model primarily examines software from the perspective of the complex network, abstracts the software into a graph network, learns the representation vector of nodes using a graph neural network, and categorizes nodes on the basis of the representation vector. Figure 1 depicts the overall layout of the framework. Basically, it consists of two steps. Processing the data is the first phase, followed by using the class as the research granularity, abstracting the software source code into multiple nodes, and creating a network or graph using the dependencies among classes. Lastly, community detection techniques are used to split the graph into various subgraphs. The edge-link relationship is stored in the adjacency matrix in the second phase, and the adjacency matrix and the node-level features are seen as the structural information of the graph, which are considered as input to the graph neural network to obtain the representation vector of the node. Lastly, the multilayer perceptron (MLP) is used to classify the nodes. Section 2.4.1 analyzes the first part of Figure 1, and Section 2.4.2 analyzes the second step of Figure 1.

Data Processing
The current software defect prediction models ignore the interdependence of the complex system in the software code. In order to map software systems into a graph, this research abstracts software systems from the perspective of complex networks [24]. In order to obtain the class dependency graph, we use a well-known technique. Additionally, the software defect measurement components of the class are taken into account as a nodelevel feature vector X, and the class dependence is transformed into an adjacency matrix A. Consequently, G can be used to represent the software graph (A, X). To further simplify the software graph and to make the learned representation of the graph more effective, we decided to use the Louvain algorithm to divide the graph into different subgraphs. Specific steps are as follows: (1) First, a modularity Q is defined, which is used to judge the quality of the division; its value is between −1 and 1. The formula is as follows: where m is the number of network connections, and i, j represent any two nodes in the network. When they are connected, A ij is 1; otherwise, it is 0. k i indicates the degree of node i. c i indicates the community of node I, and δ c i , c j is used to judge whether nodes i and j are in the same community. If so, it is 1; otherwise, it is 0. (2) Initially, each node belongs to a community, and there are several communities with several nodes in the current network; the modularity is calculated at this point. (3) For each node i, we consider its neighbor j and evaluate the modular gain caused by deleting it from the original community and affiliating it to the other community. We divide it into communities with the largest gain and greater than 0. If the gain of all communities is less than or equal to 0, the node will not carry out community transfer. This process is applied to all nodes repeatedly and sequentially, until there is no further improvement, at which point this step ends. The modular gain is calculated as follows: where ∑ in is the number of edges in the community c, ∑ tot is the total degree of the nodes in the community c, k i is the degree of node i, k i,in is the sum of the number of connections between node i and the nodes in community c, and m is the number of connections in the network. (4) The obtained communities in the previous step are taken as nodes, and a new network is reconstructed. The connection weight between nodes is the sum of all nodes in the original network between the two communities. The nodes have self-circulation, and the weight is the sum of connections of the original nodes in the community. Then, step 3 is repeated for the new network until there are no further gain updates, and the algorithm ends.
The software graph structure can be divided into multiple subgraphs through the Louvain algorithm; therefore, the graph can be represented by respectively represent the adjacency matrix of the edges in the subgraph, and the software defect measurement values of the node.

Learning and Classification of the Node Representation Vector
The explicit feature information of the node and the structural information of the graph network can both be used by the graph neural network to learn the representation vector of the node. It solves the issue that the current software defect prediction model only considers one of the two. The input for the graph neural network model is the data that were obtained following the data processing in Section 2.4.1. A graph neural network's architecture is shown in Figure 2. The entire framework may be divided into two parts: the node representation vector and the graph convolution process, which learns the representation vector of nodes on top of the graph. The classifier's design, which is covered in the Section 2, primarily utilizes the multilayer perceptron to classify, and the outcomes of each layer are combined through weights as the final result.
(1) The node representation vector is learned using the graph neural network. Each subgraph undergoes multilayer graph convolution in order for nodes to gain deep semantic information, and each layer's representation vector is described by the following formula: where L l represents the representation vector of all nodes of the L-th layer, H l i represents the representation vector of all nodes of the i-th subgraph after the L-th graph convolution, H 0 i is the initial node information X i of each subgraph, num_subgraph represents the number of all subgraphs of a software, num_gcn represents the number of layers of the convolution layer, the new representation vector H l+1 i is obtained by inputting the representation vector H l i of the previous layer of the subgraph and its adjacency matrix A i into the convolution layer of the graph, the cat function concatenates the node representation vectors of all subgraphs into a whole, and GNN is the graph convolution.
(2) A classifier is created using graph convolution that learns the representation vector, predicts the output of each layer using MLP, and convolves the output of each layer using a different depth graph. This model chooses to assign a learnable weight to each layer's output. The representation vector can be utilized more efficiently in this way, and the precise formula is as follows: where w j is a learnable parameter, the initial value of which is set to (1/n, 1/n), MLP is a multilayer perceptron, whereby each layer representation vector is set with an MLP, L is the representation vector of each layer, num_gcn is the number of graph convolution layers, and out represents the label obtained after the node passes through the model. (3) The pseudocode of the method, which is provided below, presents the process of a thorough Algorithm 1 that demonstrates how each node can learn a representation vector and generate predictions. The graph structure G = {G 1 , G 2 , . . . , G n }. G i = (A i , X i ), A i , and X i represent nodes, the adjacency matrix of the edges, and the software defect measurement element of the nodes, respectively. Output: The prediction result pred of the node 1.
for i in num_layer do 2.
for j in num_subgraph do

6.
Put the subgraph G i = (A i , X i ) into the graph convolution layer to learn the representation vector of the node; 7.
end for 8.
end for 11. return pred The algorithm mentioned above has two improvements, as can be seen. First, the software source code's graph structure is initially divided into a number of smaller graphs, which are then used as inputs for graph neural network models. Second, each layer's prediction results are given some weight.

Experimental Environment and Datasets
Experiments were performed on the Windows-based operating system, the language used was python [25], and the construction of the graph neural network model was completed through PyTorch [26] and torch-geometric.
The PROMISE dataset [27], a collection of open-source software projects, serves as the dataset in use. Six projects were picked from this dataset, which contains object-oriented measurement elements for all of the dataset's measurement items. The dataset is described in Table 1. It can be found that there is a class imbalance problem existing in the data. To improve the dataset, we first used the NearMiss algorithm [28], which reduced the amount of data in the experiment and test. During the experimenting, tenfold cross-validation was used. Each time, 90% of the data were randomly selected for training, 10% of the data were tested, and the results are given using an average of 10 times the data.

Evaluation Measures
To prove the validity of proposed model, the selected evaluation measures such as the accuracy rate, F-measure, and MCC value were used, which were all obtained through the confusion matrix. The confusion matrix [29] is shown in Table 2.
The F-measure is the harmonic average of precision rate and recall rate. Precision rate P refers to the proportion of the number of positive samples correctly classified by the classifier to the overall number of positive samples classified by the classifier, and recall rate R refers to the proportion of the number of positive samples correctly classified by the classifier to the number of desired positive samples. The value range is [0, 1], whereby a higher value indicates better classification. The formula is as follows:  1]. A prediction with a value of 1 is considered to be perfect; a prediction with a value of 0 is considered to be only slightly better than a random guess; and a prediction with a value of −1 is considered to be wholly incongruent with the actual result. The equation reads as follows:

Experimental Setup
In order to reduce the influence of experimental parameters, some training hyperparameters were set as shown in Table 3. The parameters used in the traditional method of SVM (support vector machine) were set as default. The network structure of other models are described below. The network structures of the BP neural network comprised four layers. Without using a classifier, GCN was constructed in accordance with the model described in [20], which directly derived the result via a graph convolution operation. The GIN structure was developed in accordance with [23]. The difference is that there was no community division and no final weighted summation. According to the graph neural network framework proposed in this paper, CBGCN (community-based GCN) and CBGIN (community-based GIN) were built. The difference was the graph convolution operation, with the former based on the spectral domain, and the latter based on the spatial domain. The classifier, MLP (multilayer perceptron), and BP neural network were all connected. The specific parameters are shown in Table 4.

Experimental Procedure
This section presents some assumptions and limitations during the experiment, and then describes the specific steps of the experiment. The assumptions in this paper are as follows: (1) We focus on software defect prediction within a project, and the training and testing data are derived from one dataset. For example, when experimenting with ant dataset, the training set is selected and the test set is derived from the remainder of the dataset. (2) During the experiments, a small number of defective classes cause the trained model to favor the non-defective classes. Therefore, class imbalance is applied to the entire dataset before training the model. (3) To better estimate the algorithm performance, a tenfold cross-validation is used.
Under the above assumptions, the validity of the software defect prediction framework proposed in this paper can be verified. The specific experimental procedure is as follows: Step 1: Using the code analysis tool, the software's class dependence is extracted, and a CSV file is then generated.
Step 2: The labeled nodes and feature metrics are obtained for the nodes from the PROMISE dataset.
Step 3: NetworkX, a third-party package in python, is used to store the graph structure. Then, the python-louvain package in python is used to divide the graph structure into subgraphs.
Step 4: The NearMiss algorithm is applied to deal with data class imbalance. Then, 90% of the processed dataset is chosen at random for the graph neural network model's training, and 10% is chosen for its testing.
Step 5: The graph structure from step 3 is used as the input to the graph neural network model. The training set labels are picked in step 4 to train the network parameters.
Step 6: Then, 10% of the data in Step 4 are used for testing, before calculating the performance on various evaluation metrices.
Step 7: The process is repeated 10 times from Step 4 onward.

Results and Discussion
The spectral domain-based graph convolutional neural network GCN and the spatial domain-based graph convolutional neural network GIN were chosen for studies to show that this model can increase the performance of software defect detection. The models were consequently divided into two groups, the first of which consisted of SVM, BP, GCN, and CBGCN, and the second of which consisted of SVM, BP, GIN, and CBGIN. SVM and BP, the two most fundamental machine learning algorithms, directly use the original feature vector to forecast software problems. Graph convolution is employed by both model frameworks used in the original paper-GCN and GIN-to obtain the characteristics of the graph structure. Two distinct graph convolution techniques were merged by CBGCN and CBGIN to create this model.

Experimental Analysis of Graph Convolutional Neural Network Based on Spectral Domain
In this section, the graph convolution method based on the spectral domain is used as the convolution layer of this model. In order to verify that the graph convolution method can improve the performance of software defect prediction, in this work, in addition to the traditional method, the graph convolutional neural network [20] model was selected as a benchmark model. The results are shown in Table 5.
It can be seen from Table 5 that the proposed model has achieved good experimental results in terms of accuracy, F-measure, and MCC in most datasets. In terms of accuracy, it was 7.6% higher than SVM, 5.8% higher than BP, and 13.6% higher than GCN. The F-measure was 10.7% higher than SVM, 7.8% higher than BP, and 13.1% higher than GCN. The MCC index was 12.5% higher than SVM, 12.7% higher than BP, and 27.4% higher than GCN. The data were analyzed from two aspects: (1) Comparing CBGCN with SVM and BP, it was found that our model was better than the BP neural network and SVM according to the evaluation of all metrics from other datasets except for the Ant dataset. It was found that, in the Ant dataset, the result of the BP neural network was also lower than that of the SVM. In individual datasets, the parameters of BP need to be specially set to obtain the best performance, and the structure of the CBGCN classifier is the same as the BP neural network. Therefore, individual datasets need to adjust the parameter settings of the network. However, in terms of average, CBGCN was greatly improved; thus, it can be concluded that useful feature vectors can be learned by incorporating the spectral domain-based graph convolution method into this model. (2) Comparing CBGCN with GCN, we found that, except for the Lucene dataset, the experimental results were very similar. Other datasets greatly improved the model, and the average of the evaluation measures was higher; therefore, it can be concluded that the model framework of this paper was more suitable for software defect prediction. In order to visually observe the performance of the CBGCN algorithm, various evaluation measures were determined as box plots [30]. The y-axis represents the evaluation metric score, while the prediction method is on x-axis in Figure 3. The mean and median values are designated as particular values in the figures to aid in analysis. Figure 3 shows that the model suggested in this paper had each average evaluation index at its highest point, and that the GCN model's framework was insufficient for predicting software defects. However, according to the experimental results of CBGCN, the graph convolution method based on the spectral domain could enhance each evaluation index of software defect prediction.

Experimental Analysis of Graph Convolutional Neural Network Based on Spectral Domain
In the experiments in this section, the spatial domain-based graph convolution method is used as the convolution layer of this model. In order to verify that the graph convolution method could still improve the performance of software defect prediction in this work, in addition to the traditional method, the spatial domain-based graph neural network model [23] was selected as a benchmark model. The results are shown in Table 6.  Table 6 shows that, in most datasets, the model proposed in this paper achieved good experimental results in terms of accuracy, F-measure, and MCC. On average, the accuracy was 8.5% higher than SVM, 6.8% higher than BP, and 3.5% higher than GIN. The F-measure was 10.8% higher than SVM, 7.9% higher than BP, and 2.4% higher than GIN. The MCC was 14.6% higher than SVM, 14.7% higher than BP, and 7.0% higher than GIN. The results were analyzed from two aspects: (1) Compared with BP, CBGIN was improved on all datasets. It was still lower than SVM in the Ant project, but higher than CBGCN, demonstrating that the classifier network structure and graph convolution method settings could impact the outcomes. Individual datasets require adjusting the network hyperparameters. Overall, there was a substantial improvement in CBGIN. Thus, it can be inferred that this model may acquire valuable feature vectors by incorporating the spatial domain-based graph convolution method. (2) When CBGIN and GIN were compared, it was discovered that the model was improved across all datasets. We can draw the conclusion that the model presented in this paper is more suited for predicting software defects.
Similarly, in order to visually observe the performance of the CBGIN algorithm, various evaluation metrics were determined as box plots. The y-axis shows the evaluation metric score, while the prediction method is on x-axis in Figure 4. For better analysis, the mean and median in the figure are marked.  Figure 4 shows that the CBGIN was improved with strong performance across all evaluation metrics, suggesting that the GIN model can use the learned representation vector more effectively and that the enhanced GIN model is more suited for software fault prediction.
The experimental results demonstrated that both the GCN and the GIN graph convolution methods can produce beneficial representation vectors to enhance the accuracy of software defect prediction. This model was improved compared to GIN and GCN, showing that the model suggested in this research can increase the accuracy of software defect prediction.

Conclusions and Future Work
In this paper, we mapped the software to the graph structure and simplified the software graph with the community structure according to the complex network theories. Furthermore, we used the convolutional layer in the graph neural network to obtain the graph information of the software. In this way, the software was regarded in its entirety, and independent classes were linked through class dependencies for software defect prediction. The graph convolution layer selected the graph convolution method as GCN and GIN for experiments and used the PROMISE dataset for verification. The experimental results show that the graph neural network could obtain better representation vectors of nodes, thereby improving the performance of software defect prediction.
This research highlights the importance of a software defect prediction framework based on multiple factors, by modeling the software into a more complex network, which considers the connections between the software modules and the attributes of modules. The following suggestions for future work can be derived from this experiment: (1) A more complex network can be constructed, for example, considering developer information, and semantic information of software code can be incorporated into the network. (2) For the improvement of the graph neural network, it can be combined with a community discovery algorithm.

Acknowledgments:
The authors would also like to thank Robert Andrew James and all the anonymous reviewers for their valuable comments.

Conflicts of Interest:
The authors declare no conflict of interest.