# Research of Software Defect Prediction Model Based on Complex Network and Graph Neural Network

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- (1)
- The application of the graph neural network in the complex network to make software defect prediction, followed by the use of the graph neural network to combine the structure of the software class graph along with the software’s class-level measurement element (node-level features, e.g., prior fault and new data) to learn new feature vectors. This represents an additional consideration in our model, compared with previous models, which only considered software graph structure or software defect measurement elements.
- (2)
- Use of the community detection algorithm to decompose the software graph structure into multiple subgraphs, and use of all the subgraphs as the input of the graph neural network model. This further simplifies the software graph structure, and the learned graph structure is a closely related subgraph.
- (3)
- Improvement of the graph convolutional neural network, such that the graph neural network can learn the graph structure features that are conducive to software defect prediction.

## 2. Materials and Methods

#### 2.1. Software Diagram Structure

#### 2.1.1. Complex Network

#### 2.1.2. Software Class Depends on the Network

#### 2.2. Community Detection

- (1)
- Every node starts off as a community. If a node’s modular gain from its current community to the community of its neighboring nodes is more than zero, the node will become affiliated with the community of its adjacent nodes, and its community affiliation will change. On the other hand, the initial community will be preserved until any node’s community change does not result in a modular gain that is more than zero.
- (2)
- A new network is created using the community acquired in the previous step as a node. The connection weight between nodes is the sum of all nodes in the original network between the two communities. The weight of the nodes, which have a self-circulation, is the total number of connections between the initial nodes in the community. When there is no gain update, step 1 is repeated for the new network.

#### 2.3. Graph Neural Network

- (1)
- GCNs based on the spectral domain include SCNN (spectral CNN) [18], ChebNet (Chebyshev spectral CNN) [19], and GCN [20]. The spectral domain convolution maps the graph topology into the spectral domain through discrete Fourier transformation, and then defines its graph convolution operator. The GCN convolution process can be represented by the following formula:$${H}^{\left(l+1\right)}=\sigma \left({\tilde{D}}^{-\frac{1}{2}}\tilde{A}{\tilde{D}}^{-\frac{1}{2}}{H}^{\left(l\right)}{W}^{\left(l\right)}\right),$$
- (2)
- GCNs based on the spatial domain include GraphSAGE (graph sample and aggregate) [21], GAT (graph attention network) [22], and GIN (graph isomorphism network) [23]. Spatial convolution aggregates the feature vectors of the first-order adjacent nodes of a node and then combines them with feature vectors of the current node. The graph convolution formula of GIN is as follows:$${h}_{v.}^{\left(k\right)}=ML{P}^{\left(k\right)}\left(\left(1+{\epsilon}^{k}\right)\xb7{h}_{v}^{\left(k-1\right)}+{\displaystyle {\sum}_{u\in {N}_{\left(v\right)}}{h}_{u}^{\left(k-1\right)}}\right)$$

#### 2.4. Software Defect Prediction Model Based on Complex Network and Graph Neural Network

#### 2.4.1. Data Processing

- (1)
- First, a modularity Q is defined, which is used to judge the quality of the division; its value is between −1 and 1. The formula is as follows:$$Q=\frac{1}{2m}{\displaystyle \sum _{i,j}\left[{A}_{ij}-\frac{{k}_{i}{k}_{j}}{2m}\right]}\delta \left({c}_{i},{c}_{j}\right),$$
- (2)
- Initially, each node belongs to a community, and there are several communities with several nodes in the current network; the modularity is calculated at this point.
- (3)
- For each node i, we consider its neighbor j and evaluate the modular gain caused by deleting it from the original community and affiliating it to the other community. We divide it into communities with the largest gain and greater than 0. If the gain of all communities is less than or equal to 0, the node will not carry out community transfer. This process is applied to all nodes repeatedly and sequentially, until there is no further improvement, at which point this step ends. The modular gain is calculated as follows:$$\Delta Q=\left[\frac{{\displaystyle {\sum}_{in}+{k}_{i,in}}}{2m}-{\left(\frac{{\displaystyle {\sum}_{tot}+{k}_{i}}}{2m}\right)}^{2}\right]-\left[\frac{{\displaystyle {\sum}_{in}}}{2m}-{\left(\frac{{\displaystyle {\sum}_{tot}}}{2m}\right)}^{2}-{\left(\frac{{k}_{i}}{2m}\right)}^{2}\right],$$
- (4)
- The obtained communities in the previous step are taken as nodes, and a new network is reconstructed. The connection weight between nodes is the sum of all nodes in the original network between the two communities. The nodes have self-circulation, and the weight is the sum of connections of the original nodes in the community. Then, step 3 is repeated for the new network until there are no further gain updates, and the algorithm ends.

#### 2.4.2. Learning and Classification of the Node Representation Vector

- (1)
- The node representation vector is learned using the graph neural network. Each subgraph undergoes multilayer graph convolution in order for nodes to gain deep semantic information, and each layer’s representation vector is described by the following formula:$${L}_{l}=cat\left({H}_{0}^{l},\dots ,{H}_{i}^{l}\right),l\in \left[0,num\_gcn\right],i=num\_subgraph$$$${H}_{i}^{l+1}=GNN\left({A}_{i},{H}_{i}^{l}\right)$$
- (2)
- A classifier is created using graph convolution that learns the representation vector, predicts the output of each layer using MLP, and convolves the output of each layer using a different depth graph. This model chooses to assign a learnable weight to each layer’s output. The representation vector can be utilized more efficiently in this way, and the precise formula is as follows:$$out={\displaystyle \sum _{j=0}^{n}{w}_{j}\left(ML{P}_{j}{L}_{j}\right),n=num\_gcn},$$
- (3)
- The pseudocode of the method, which is provided below, presents the process of a thorough Algorithm 1 that demonstrates how each node can learn a representation vector and generate predictions.

Algorithm 1: Graph neural network learning and prediction | |

Input: | The graph structure $G=\left\{{G}_{1},{G}_{2},\dots ,{G}_{n}\right\}$$.{G}_{i}=\left({A}_{i},{X}_{i}\right)$$,{A}_{i}$, and ${X}_{i}$ represent nodes, the adjacency matrix of the edges, and the software defect measurement element of the nodes, respectively. |

Output: | The prediction result pred of the node |

1. | for i in num_layer do |

2. | //num_layer: the number of graph convolutional layers |

3. | //num_subgraph: the number of subgraphs |

4. | L = 0; |

5. | for j in num_subgraph do |

6. | Put the subgraph ${G}_{i}=\left({A}_{i},{X}_{i}\right)$ into the graph convolution layer to learn the representation vector of the node; |

7. | end for |

8. | L = predicted result of MLP; |

9. | pred + = W × L; |

10. | end for |

11. | return pred |

## 3. Simulation Experiments

#### 3.1. Experimental Environment and Datasets

#### 3.2. Evaluation Measures

#### 3.3. Experimental Setup

#### 3.4. Experimental Procedure

- (1)
- We focus on software defect prediction within a project, and the training and testing data are derived from one dataset. For example, when experimenting with ant dataset, the training set is selected and the test set is derived from the remainder of the dataset.
- (2)
- During the experiments, a small number of defective classes cause the trained model to favor the non-defective classes. Therefore, class imbalance is applied to the entire dataset before training the model.
- (3)
- To better estimate the algorithm performance, a tenfold cross-validation is used.

## 4. Results and Discussion

#### 4.1. Experimental Analysis of Graph Convolutional Neural Network Based on Spectral Domain

- (1)
- Comparing CBGCN with SVM and BP, it was found that our model was better than the BP neural network and SVM according to the evaluation of all metrics from other datasets except for the Ant dataset. It was found that, in the Ant dataset, the result of the BP neural network was also lower than that of the SVM. In individual datasets, the parameters of BP need to be specially set to obtain the best performance, and the structure of the CBGCN classifier is the same as the BP neural network. Therefore, individual datasets need to adjust the parameter settings of the network. However, in terms of average, CBGCN was greatly improved; thus, it can be concluded that useful feature vectors can be learned by incorporating the spectral domain-based graph convolution method into this model.
- (2)
- Comparing CBGCN with GCN, we found that, except for the Lucene dataset, the experimental results were very similar. Other datasets greatly improved the model, and the average of the evaluation measures was higher; therefore, it can be concluded that the model framework of this paper was more suitable for software defect prediction.

#### 4.2. Experimental Analysis of Graph Convolutional Neural Network Based on Spectral Domain

- (1)
- Compared with BP, CBGIN was improved on all datasets. It was still lower than SVM in the Ant project, but higher than CBGCN, demonstrating that the classifier network structure and graph convolution method settings could impact the outcomes. Individual datasets require adjusting the network hyperparameters. Overall, there was a substantial improvement in CBGIN. Thus, it can be inferred that this model may acquire valuable feature vectors by incorporating the spatial domain-based graph convolution method.
- (2)
- When CBGIN and GIN were compared, it was discovered that the model was improved across all datasets. We can draw the conclusion that the model presented in this paper is more suited for predicting software defects.

## 5. Conclusions and Future Work

- (1)
- A more complex network can be constructed, for example, considering developer information, and semantic information of software code can be incorporated into the network.
- (2)
- For the improvement of the graph neural network, it can be combined with a community discovery algorithm.
- (3)
- The experiments in this paper considered within-project software defect prediction; thus, in the future, cross-project software defect prediction can be considered.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

SMOTE | Synthetic minority oversampling technique |

ENN | Extended nearest neighborhood algorithm |

BP | Backpropagation |

NLP | Natural language processing |

AST | Abstract syntax tree |

CNN | Convolution neural network |

GNN | Graph neural network |

GCN | Graph convolutional neural network |

SCNN | Spectral CNN |

ChebNet | Chebyshev spectral CNN |

GraphSAGE | Graph sample and aggregate |

GAT | Graph attention networks |

GIN | Graph isomorphism network |

MLP | Multilayer perceptron |

SVM | Support vector machine |

CBGCN | Community-based GCN |

CBGIN | Community-based GIN |

MCC | Matthews correlation coefficient |

## References

- Gonzalez-Barahona, J.M.; Izquierdo-Cortazar, D.; Robles, G. Software Development Metrics with a Purpose. Computer
**2022**, 55, 66–73. [Google Scholar] [CrossRef] - Liu, Y.; Sun, F.; Yang, J.; Zhou, D. Software Defect Prediction Model Based on Improved BP Neural Network. In Proceedings of the 2019 6th International Conference on Dependable Systems and Their Applications (DSA), Harbin, China, 3–6 January 2020; pp. 521–522. [Google Scholar]
- Bashir, K.; Li, T.; Yahaya, M. A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction. Int. Arab J. Inf. Technol.
**2020**, 17, 721–730. [Google Scholar] [CrossRef] - Goyal, S. Effective software defect prediction using support vector machines (SVMs). Int. J. Syst. Assur. Eng. Manag.
**2022**, 13, 681–696. [Google Scholar] [CrossRef] - Farid, A.B.; Fathy, E.M.; Eldin, A.S.; Abd-Elmegid, L.A. Software defect prediction using hybrid model (CBIL) of convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM). PeerJ Comput. Sci.
**2021**, 7, e739. [Google Scholar] [CrossRef] [PubMed] - Deng, J.; Lu, L.; Qiu, S. Software defect prediction via LSTM. IET Softw.
**2020**, 14, 443–450. [Google Scholar] [CrossRef] - Šubelj, L.; Bajec, M. Community structure of complex software systems: Analysis and applications. Phys. A Stat. Mech. Its Appl.
**2011**, 390, 2968–2975. [Google Scholar] [CrossRef] - Zhou, Y.; Zhu, Y.; Chen, L. Software Defect-Proneness Prediction with Package Cohesion and Coupling Metrics Based on Complex Network Theory. Proceeding of the International Symposium on Dependable Software Engineering: Theories, Tools, and Applications, Guangzhou, China, 24–27 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 186–201. [Google Scholar] [CrossRef]
- Qu, Y.; Yin, H. Evaluating network embedding techniques’ performances in software bug prediction. Empir. Softw. Eng.
**2021**, 26, 60. [Google Scholar] [CrossRef] - Al-Andoli, M.N.; Tan, S.C.; Cheah, W.P.; Tan, S.Y. A Review on Community Detection in Large Complex Networks from Conventional to Deep Learning Methods: A Call for the Use of Parallel Meta-Heuristic Algorithms. IEEE Access
**2021**, 9, 96501–96527. [Google Scholar] [CrossRef] - Euler, L. Solutio problematis ad geometriam situs pertinentis. Comment. Acad. Sci. Petropolitanae
**1741**, 8, 128–140. [Google Scholar] - Wheeldon, R.; Counsell, S. Power law distributions in class relationships. In Proceedings of the Third IEEE International Workshop on Source Code Analysis and Manipulation, Amsterdam, The Netherlands, 26–27 September 2003; pp. 45–54. [Google Scholar]
- Newman, M.E.J.; Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E
**2004**, 69, 026113. [Google Scholar] [CrossRef] [PubMed][Green Version] - Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp.
**2008**, 2008, P10008. [Google Scholar] [CrossRef] - Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; pp. 729–734. [Google Scholar] [CrossRef]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw.
**2008**, 20, 61–80. [Google Scholar] [CrossRef] [PubMed] - Asif, N.A.; Sarker, Y.; Chakrabortty, R.K.; Ryan, M.J.; Ahamed, H.; Saha, D.K.; Badal, F.R.; Das, S.K.; Ali, F.; Moyeen, S.I.; et al. Graph Neural Network: A Comprehensive Review on Non-Euclidean Space. IEEE Access
**2021**, 9, 60588–60606. [Google Scholar] [CrossRef] - Estrach, J.B.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and deep locally connected networks on graphs. In Proceedings of the 2nd International Conference on Learning Representations, ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. arXiv
**2016**, arXiv:1606.09375. [Google Scholar] - Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017; Available online: https://openreview.net/forum?id=SJU4ayYgl (accessed on 18 August 2022).
- Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. arXiv
**2017**, arXiv:1706.02216. [Google Scholar] - Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018; Available online: https://openreview.net/forum?id=rJXMpikCZ (accessed on 18 August 2022).
- Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=ryGs6iA5Km (accessed on 18 August 2022).
- Yang, S.; Gou, X.; Yang, M.; Shao, Q.; Bian, C.; Jiang, M.; Qiao, Y. Software Bug Number Prediction Based on Complex Network Theory and Panel Data Model. IEEE Trans. Reliab.
**2022**, 71, 162–177. [Google Scholar] [CrossRef] - Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. Available online: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf (accessed on 18 August 2022).
- Jureczko, M.; Madeyski, L. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, Timisoara, Romania, 12–13 September 2010; pp. 1–10. [Google Scholar]
- Mani, I.; Zhang, I. kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of the Workshop on Learning from Imbalanced Datasets, ICML, Washington, DC, USA, 21–24 August 2003; pp. 1–7. [Google Scholar]
- Zhang, Q.; Ren, J. Software-defect prediction within and across projects based on improved self-organizing data mining. J. Supercomput.
**2022**, 78, 6147–6173. [Google Scholar] [CrossRef] - Goyal, S. Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction. Artif. Intell. Rev.
**2022**, 55, 2023–2064. [Google Scholar] [CrossRef]

Datasets | Version | Number of Features | Number of Nodes | Number of Edges | Number of Defects | Defect Rate |
---|---|---|---|---|---|---|

Ant | 1.7.0 | 20 | 745 | 3961 | 166 | 0.2228 |

Camel | 1.6.0 | 20 | 965 | 4215 | 188 | 0.1948 |

Lucene | 2.4 | 20 | 340 | 1559 | 203 | 0.5970 |

Synapse | 1.2 | 20 | 256 | 1162 | 86 | 0.3359 |

Velocity | 1.6.1 | 20 | 229 | 1292 | 78 | 0.3406 |

Ivy | 2 | 20 | 352 | 2063 | 40 | 0.1136 |

Actual Label | Predicted Label | |
---|---|---|

Defective | Defective-Free | |

Defective | TP (true positive) | FN (false negative) |

Defective-free | FP (false positive) | TN (true negative) |

Parameter | Setting |
---|---|

Number of iterations | 10,000 |

Learning rate | 0.0001 |

Weight_decay | 5 × 10^{−4} |

Loss function | CrossEntropyLoss |

Optimizer | Adam |

Activation function | Relu |

Algorithm | Number of Graph Convolution Layers | Change of Vector Dimension | Number of Layers of Classifier | Change of Vector Dimension |
---|---|---|---|---|

BP | 0 | None | 4 | 20, 10, 10, 2 |

GCN | 2 | 20, 16, 2 | 0 | None |

CBGCN | 4 | 20, 20, 20, 20, 20 | 4 | 20, 10, 10, 2 |

GIN | 4 | 20, 20, 20, 20, 20 | 2 | 20, 2 |

CBGIN | 4 | 20, 20, 20, 20, 20 | 2 | 20, 2 |

Dataset | Evaluation Measures | SVM | BP | GCN | CBGCN |
---|---|---|---|---|---|

Ant | Accuracy | 0.9091 | 0.8794 | 0.7000 | 0.8735 |

F-measure | 0.8942 | 0.8656 | 0.7178 | 0.8629 | |

MCC | 0.8208 | 0.7614 | 0.4218 | 0.7553 | |

Camel | Accuracy | 0.8081 | 0.8658 | 0.7263 | 0.8974 |

F-measure | 0.7546 | 0.8655 | 0.7440 | 0.8965 | |

MCC | 0.6682 | 0.7309 | 0.4589 | 0.7985 | |

Lucene | Accuracy | 0.5704 | 0.6519 | 0.7370 | 0.7296 |

F-measure | 0.6054 | 0.6431 | 0.7301 | 0.7214 | |

MCC | 0.1496 | 0.2955 | 0.4751 | 0.4598 | |

Synapse | Accuracy | 0.7529 | 0.7667 | 0.7833 | 0.8833 |

F-measure | 0.6727 | 0.7550 | 0.7971 | 0.8849 | |

MCC | 0.5526 | 0.5564 | 0.5681 | 0.7748 | |

Velocity | Accuracy | 0.8000 | 0.8812 | 0.6562 | 0.9000 |

F-measure | 0.7290 | 0.8784 | 0.6651 | 0.9007 | |

MCC | 0.6610 | 0.7631 | 0.3003 | 0.7973 | |

Ivy | Accuracy | 0.9000 | 0.8000 | 0.7750 | 0.9125 |

F-measure | 0.8489 | 0.6705 | 0.7094 | 0.8820 | |

MCC | 0.8030 | 0.5393 | 0.5429 | 0.8224 |

Dataset | Evaluation Measures | SVM | BP | GIN | CBGIN |
---|---|---|---|---|---|

Ant | Accuracy | 0.9091 | 0.8794 | 0.8912 | 0.8853 |

F-measure | 0.8942 | 0.8656 | 0.8833 | 0.8712 | |

MCC | 0.8208 | 0.7614 | 0.7869 | 0.7790 | |

Camel | Accuracy | 0.8081 | 0.8658 | 0.8842 | 0.8921 |

F-measure | 0.7546 | 0.8655 | 0.8877 | 0.8858 | |

MCC | 0.6682 | 0.7309 | 0.7754 | 0.7954 | |

Lucene | Accuracy | 0.5704 | 0.6519 | 0.7074 | 0.7519 |

F-measure | 0.6054 | 0.6431 | 0.7075 | 0.7295 | |

MCC | 0.1496 | 0.2955 | 0.4124 | 0.5121 | |

Synapse | Accuracy | 0.7529 | 0.7667 | 0.7889 | 0.8722 |

F-measure | 0.6727 | 0.7550 | 0.7838 | 0.8715 | |

MCC | 0.5526 | 0.5564 | 0.5890 | 0.7554 | |

Velocity | Accuracy | 0.8000 | 0.8812 | 0.8812 | 0.9250 |

F-measure | 0.7290 | 0.8784 | 0.8925 | 0.9218 | |

MCC | 0.6610 | 0.7631 | 0.7640 | 0.8514 | |

Ivy | Accuracy | 0.9000 | 0.8000 | 0.8875 | 0.9250 |

F-measure | 0.8489 | 0.6705 | 0.8559 | 0.8737 | |

MCC | 0.8030 | 0.5393 | 0.7821 | 0.8378 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cui, M.; Long, S.; Jiang, Y.; Na, X.
Research of Software Defect Prediction Model Based on Complex Network and Graph Neural Network. *Entropy* **2022**, *24*, 1373.
https://doi.org/10.3390/e24101373

**AMA Style**

Cui M, Long S, Jiang Y, Na X.
Research of Software Defect Prediction Model Based on Complex Network and Graph Neural Network. *Entropy*. 2022; 24(10):1373.
https://doi.org/10.3390/e24101373

**Chicago/Turabian Style**

Cui, Mengtian, Songlin Long, Yue Jiang, and Xu Na.
2022. "Research of Software Defect Prediction Model Based on Complex Network and Graph Neural Network" *Entropy* 24, no. 10: 1373.
https://doi.org/10.3390/e24101373