Fusing Semantic and Structural Features for Code Error Detection

Yiwen Zhang; Wei Liu; Fazhong Jiang; Jiquan Ma; Jingtai Cao

doi:10.3390/e27121229

,

and

¹

National Astronomical Observatories, Chinese Academy of Sciences, Beijing 100101, China

²

Department of Communication Engineerings, Jilin University, Changchun 130012, China

³

School of Computer and Big Data, Heilongjiang University, Harbin 150080, China

⁴

Changchun Changguang Orion Optoelectronic Technology Co., Ltd., Changchun 130033, China

Entropy2025, 27(12), 1229;https://doi.org/10.3390/e27121229

This article belongs to the Special Issue Rethinking Representation Learning in the Age of Large Models

Version Notes

Order Reprints

Abstract

Large Language Models of the Transformer architecture display great promise in automated code error detection based on their strength in processing sequential data. Nevertheless, their efficacy could be further improved by addressing the inherent weakness in handling structural code dependencies. In response to this, we introduce a novel model that integrates the semantic comprehension power of RoBERTa with the structural learning strength of Graph Neural Networks. This model aims to detect the most common categories of programming faults in the form of runtime errors, index errors, and import/module errors. Experimental evaluation has demonstrated that the hybrid model, utilizing a proper fusion technique, outperforms other models in terms of accuracy and robustness. The introduced mechanism leads to numerical benefits, improving test accuracy by 1.75% over competitive baseline.

Keywords:

code error detection; large language models; graph neural networks

1. Introduction

Code error detection is a key component in software development, which helps quickly identify and fix errors, thereby improving software quality and efficiency. However, conventional approaches are unable to keep up with scale and diversity in large codebases. Static code analysis and dynamic debugging are subject to severe limitations. Static code analysis checks syntax and semantics without being able to catch runtime faults like memory leaks or concurrency faults. Dynamic debugging, on the other hand, can identify runtime faults more reliably but is a time-consuming and inefficient process and not feasible for large projects. Due to this, Machine Learning (ML) and Deep Learning (DL) approaches have been proposed as viable alternatives to improve fault detection capability. Especially Natural Language Processing (NLP) models like BERT [1] and GPT [2], as well as other pre-trained models, can understand the semantics of code just like understanding human language. These models can learn to recognize patterns, relationships, and structures in code through training on large-scale code repositories, thereby identifying both simple and complex errors, while promising, these models also suffer from limitations in handling long-range dependencies and dynamic behaviors of large-scale systems. Thus, we need novel methods that bring together multiple modalities to tackle the challenge of code fault detection in real-world software engineering in a better way.

In response to the limitations of traditional techniques, numerous studies have explored the use of deep learning in code analysis tasks. Early approaches borrowed architectures from NLP, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to learn semantic features and contextual patterns from source code [3,4]. Models like DeepBugs [5] and BugBuster [6] demonstrated that neural networks can detect common programming mistakes and even binary-level errors by analyzing large-scale code data.

More recently, the rise of large pre-trained Large Language Models (LLMs) such as BERT [1], RoBERTa [7], and CodeBERT [8] has significantly improved the performance of code understanding and error detection tasks. These models learn syntactic and semantic information through large-scale self-supervised training. To further enhance model effectiveness, recent studies have shown that the use of LLMs and Graph Neural Networks (GNNs) in conjunction is a highly effective approach in tasks such as text classification, leveraging the synergy of semantic and structural reasoning.

This effectiveness is successfully applied to code analysis, where semantic context and structural patterns also play a decisive role. For example, the WitheredLeaf framework [9] leverages GPT-4 [2] alongside static analysis tools, emphasizing the importance of extracting rich structural and semantic information from code for tasks like entity inconsistency detection.

Based on this foundation, this paper proposes a novel model that combines RoBERTa [7] and the Graph Isomorphism Network (GIN) for code error detection. With the combination of multiple levels of structural and semantic information, this model can efficiently identify global code dependencies as well as local context. The structural relationships from code snippets are modeled by GIN while token-level semantic representations are encoded by RoBERTa. The model with the fusion of these features gain the ability to detect a wide range of code issues, including semantic inconsistencies, logical errors, and potential security vulnerabilities.

To summarize, the main contributions of this work are as follows:

We propose a novel framework that integrates LLMs and GNN for processing code and detecting multiple types of errors. This method fully utilizes the powerful semantic understanding ability of LLM and the structural reasoning ability of GNN, which can simultaneously capture the semantic context and structural patterns in the code, achieving accurate detection of diverse error types.
We systematically studied different fusion strategies of GNN and LLMs, and deeply analyzed the impact of various fusion methods on the structural representation ability of the model. The research results indicate that designing an effective fusion mechanism plays a key role in leveraging the advantages of semantic and structural information and improving the effectiveness of code analysis.
We introduced the PytraceBugs dataset for code error classification and verified through extensive experiments that our proposed model with cross-modal fusion outperforms other models, including model that relies solely on RoBERTa. The experimental results show that a reasonable fusion strategy can improve the understanding ability of code information.

Experimental results demonstrate that this model with appropriate fusion strategy not only outperforms traditional model but also provides a scalable and robust solution for error detection in complex, real-world code scenarios.

The remainder of the paper is organized as follows. Section 2 reviews related work on code error detection, LLMs, GNNs, and their integration. Section 3 introduces the details of RoBERTa with GNN model. Section 4 presents experimental settings, results, and analysis. Finally, Section 5 gives the conclusion of this paper.

3. Methodology

3.1. Network Overview

To effectively classify code errors, the study proposes a model that combines the semantic information extracted by the RoBERTa model with the structural information extracted by a GNN. This deployment of two features allows the model to learn more about the structure and content of the code. The architecture of the model is shown in the picture below (Figure 1).

Figure 1. The Architecture of the model. The LLMs driven Syntax feature extractor component is shown in the gray part, while the GNN driven Structure feature extractor component is shown in the light green part. This picture shows the model with the cross-modal fusion which is shown in the light orange part.

3.1.1. Semantic Feature Extraction with RoBERTa

RoBERTa, a pre-trained language model based on the Transformer architecture, is employed to extract semantic features from code snippets. For a given code snippet

x_{i}

, it is first tokenized using a Byte Pair Encoding (BPE) tokenizer provided by RoBERTa to form an input token sequence. This sequence is then processed by the RoBERTa model to produce a semantic embedding

v_{i} \in R^{d}

, where d is the dimension of the embedding:

v_{i} = RoBERTa (x_{i}) .

This semantic vector

v_{i}

can capture contextual relationships in code snippets, enabling the model to identify errors based on subtle differences in syntax and semantics. The semantic representation of the entire code snippet is obtained by aggregating the model’s understanding of each tag, thereby forming an abstract expression of the overall meaning of the code snippet.

3.1.2. Structural Feature Extraction with GNN

To address the limitations of semantic models in capturing code structure, this study employs a Graph Neural Network, specifically the Graph Isomorphic Network.

Code snippets are transformed into graph representations

G_{i} = (Y_{i}, E_{i})

, where

Y_{i}

denotes nodes (e.g., functions, variables, modules)

E_{i}

denotes edges (e.g., dependencies, call relationships)

In this study, the dataset does not provide explicit structures.Therefore, to derive a graph representation suitable for GNN processing, we construct a lightweight context-based graph for each batch of code snippets.Specifically, given a batch

(x_{1}, x_{2}, \dots, x_{m})

, each snippet is treated as a node, and edges are created between nodes that appear within a fixed contextual window:

E = {(k, j) ∣ | k - j | \leq 5, j \neq k} .

This sliding-window strategy captures local structural and stylistic proximity among code snippets, reflecting the intuition that adjacent fragments often share related patterns. To ensure that each node preserves its original information during message passing, self-loop edges

(k, k)

are added for all nodes.

The GIN aggregates features from each node and its neighbors through multiple layers of graph convolutions, generating a structural embedding

h_{i} \in R^{d}

:

h_{i} = GIN (G_{i}) .

To enhance the accuracy of GNN-based structural information extraction, this study preprocesses the textual content of the code snippets using the Term Frequency–Inverse Document Frequency (TF-IDF) [37] method. TF-IDF measures the importance of a specific feature (e.g., word, symbol) within a code snippet. For a code snippet

x_{i}

, the TF-IDF value for the j-th feature is defined as:

TF - IDF (x_{i}, t_{j}) = TF (t_{j}, x_{i}) \cdot log \frac{N}{DF (t_{j})},

where

TF (t_{j}, x_{i})

represents the term frequency of

t_{j}

in code snippet

x_{i}

,

DF (t_{j})

is the number of code snippets containing

t_{j}

, and N is the total number of code snippets.

The initial feature vector

{\tilde{v}}_{i}

generated by TF-IDF is used as the attributes for the nodes

V_{i}

in the GNN model. The GIN further aggregates these node features through neighbor information, updating each node’s representation:

h_{i}^{(l + 1)} = MLP ((1 + ϵ) \cdot h_{i}^{(l)} + \sum_{j \in N (i)} h_{j}^{(l)}),

where

h_{i}^{(l)}

represents the features of node i at layer l,

N (i)

denotes the set of neighboring nodes of i,

ϵ

is a learnable weight parameter, and MLP is a Multi-Layer Perceptron for feature mapping.

Finally, the structural embedding

h_{i}

generated by GIN encodes the dependency relationships, call structures, and other structural information in the code, providing critical support for subsequent classification tasks.

3.1.3. Feature Fusion and Classification

In order to fully utilize the semantic and structural information of the code, this study proposes a cross-modal fusion model that synergistically models RoBERTa and GIN based on Transformer. RoBERTa extracts contextual semantic representations of source code, while GIN models graph structural features such as dependencies and call structures in the code.

In this study, cross-modal attention fusion achieves effective interaction between the semantic vectors output by RoBERTa and the structural vectors extracted by GIN by introducing Multi Head Attention (MHA) [25] mechanism.

The cross-modal attention is defined as:

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V,

where

Q = v_{i}

,

K = {\tilde{h}}_{i}

,

V = {\tilde{h}}_{i}

. Applying multi-head attention gives:

{\hat{h}}_{i} = MultiHead (v_{i}, {\tilde{h}}_{i}, {\tilde{h}}_{i}),

where

x_{i}

is the input code snippet,

v_{i} = RoBERTa (x_{i}) \in R^{d}

is the semantic vector extracted by RoBERTa,

h_{i} = GIN (x_{i}) \in R^{d^{'}}

is the structural vector extracted by GIN (after TF-IDF initialization),

{\tilde{h}}_{i} = W_{h} h_{i} \in R^{d}

is the structural vector projected to the same dimensional space as

v_{i}

through a linear layer.

In terms of specific implementation, firstly, the structural vector

h_{i}

is transformed into the same dimension as the RoBERTa output vector through linear mapping, obtaining

{\tilde{h}}_{i}

. Subsequently, we use the semantic vector

v_{i}

extracted by RoBERTa as the Query, and the mapped structural vector

{\tilde{h}}_{i}

simultaneously input as Key and Value vectors into the multi-head attention module. This attention mechanism dynamically assigns weights based on the similarity between the Query and Key, and weights and sums the Value vectors to obtain the attention results of semantic modality on structural modality. Ultimately, the structure of attention output represents

{\hat{h}}_{i}

, regarded as the result of semantic information filtering and reconstructing structural information, it can efficiently extract the most relevant structural features to the current semantic context. This mechanism not only breaks down the information barriers between modalities, but also dynamically models the correlation between semantics and structure, achieving precise and controllable information fusion.

The final joint representation is obtained by concatenating

v_{i}

and

{\hat{h}}_{i}

:

z_{i} = [v_{i}; {\hat{h}}_{i}] \in R^{2 d},

where

{\hat{h}}_{i}

is the attention-enhanced structural representation aligned to the semantic context.

The concatenated vector is then passed through a fully connected layer followed by a softmax activation function to classify the error type:

{\hat{y}}_{i} = Softmax (W z_{i} + b),

where W and b are trainable parameters of the classifier.

By integrating semantic and structural information, the model achieves a more comprehensive understanding of the code, enabling it to detect errors with higher accuracy and robustness.

3.2. Loss Function and Class Imbalance Handling

To mitigate the effect of class imbalance observed in the dataset—particularly the overrepresentation of index errors—we applied a weighted cross-entropy loss during model training. The weight assigned to each class

c_{j}

was computed as:

w_{j} = \frac{N}{k \cdot N_{j}}, j = 1, 2, \dots, k,

(1)

where N is the total number of training samples,

N_{j}

is the number of samples in class

c_{j}

, and k is the total number of classes. These weights were incorporated into the cross-entropy loss function as follows:

L = - \frac{1}{N} \sum_{i = 1}^{N} w_{y_{i}} \cdot log P (y_{i} ∣ x_{i}),

(2)

where

P (y_{i} ∣ x_{i})

denotes the predicted probability for the true label

y_{i}

. This weighting strategy follows the approach of [38] and encourages the model to allocate greater importance to underrepresented classes during optimization.

4. Experiment

4.1. Experimental Setup

4.1.1. Dataset

This study utilizes the PyTraceBugs [39] dataset, which comprises real-world Python 3 code snippets and their corresponding error type labels. Each sample in the dataset is defined as:

D = {(x_{i}, y_{i})}_{i = 1}^{N},

where:

x_{i}

represents the code snippet, and

y_{i} \in {c_{1}, c_{2}, \dots, c_{k}}

denotes the associated error type label. Here,

k = 3

, which includes runtime errors, index errors, and import/module errors.

To evaluate the model’s performance, the dataset was divided into a training set (

D_{train}

), a validation set (

D_{val}

), and a test set (

D_{test}

), satisfying the conditions:

D_{train} \cup D_{val} \cup D_{test} = D, D_{train} \cap D_{val} = \emptyset, D_{val} \cap D_{test} = \emptyset .

The dataset was allocated in proportions of 70% for the training set, 15% for the validation set, and 15% for the test set:

p_{train} = 0.7, p_{val} = 0.15, p_{test} = 0.15 .

This partitioning ensures that the model can be optimized on the validation set and its generalization ability can be assessed on the test set.

4.1.2. Implemetation Details

In terms of model architecture, this work employs a pre-trained Transformer-based RoBERTa model to retrieve semantic information from code snippets. This enables the model to comprehend the semantic features present in the code and perform an initial classification task. Based on this model, GIN is introduced to capture structural information, including dependency relationships and control flows within code. The GIN forms a graph representation of code and embeds these structural attributes into high-dimensional vectors. Finally, the semantic vectors extracted by RoBERTa are fused with the structural vectors produced by GIN through feature concatenation and are passed into a fully connected classification layer to complete the prediction task. To further enhance the transmission of structural information and address potential issues, a cross-modal fusion technique is integrated to the combination of RoBERTa with GIN model, improving overall training stability and model performance.

Several key hyperparameters were tuned, including the learning rate (1 × 10⁻⁵, 2 × 10⁻⁵, 3 × 10⁻⁵, 5 × 10⁻⁵), batch size (8, 16, 32), and weight decay (0.01, 0.05, 0.1). Each configuration was trained for a short validation cycle, and the best-performing values on the validation set were selected for the final model. Each configuration was trained for a short validation cycle, and the best-performing values on the validation set were selected for the final model. The training and optimisation process is carried out with the use of the AdamW [40] optimiser with a starting learning rate of

3 \times 10^{- 5}

and a weight decay of 0.05 to avoid overfitting. The batch size is set to 16 to maintain consistency and stability during both training and evaluation phases. Training is split into two stages: a warm-up phase of 10,000 steps, during which the learning rate is smoothly scaled to achieve stable convergence, followed by a regular training phase with a maximum of 100,000 steps. To ensure the controllability and stability of the training process, logs are recorded every 500 steps, while model evaluations and checkpoints are saved every 10,000 steps. Additionally, mixed precision training is employed to enhance computational efficiency and reduce memory consumption.

4.2. Evaluation Matrices

Model performance is measured based on three main metrics: accuracy, macro-averaged metrics, and loss function analysis. Accuracy assesses the general correctness of the classification process, whereas macro-averaged metrics measure the model’s capacity to classify under class imbalance and achieve fair detection of all the error types, particularly minority classes. Loss function analysis is employed to monitor optimisation progress and convergence trends during training. In addition to the core evaluations, this study further analyzes the overall performance of the model with the impact of different fusion strategies.

4.3. Experimental Results

As shown in Figure 2 and Table 1, the RoBERTa baseline model performs well overall, with a testing accuracy of 69.35% and a Macro F1 score of 57.34%. After introducing GNN, although the model’s representational capacity was structurally enhanced, the overall accuracy decreased slightly, and the Macro F1 score dropped significantly by about 5.05%. This performance degradation may be due to the introduction of noise when structural features are not fully optimized in conjunction with semantic information.

Figure 2. Comparison of the precision of the four models. Precision comparison across three error categories—RuntimeError (left), ModuleImportError (middle), and IndexError (right)—for four model configurations: RoBERTa, RoBERTa with GNN, RoBERTa with GNN (Channel Attention), and RoBERTa with GNN (Cross-Modal Fusion). Each group of bars represents the precision achieved by the respective models on each error type. GNN: Graph Neural Network.

Table 1. Evaluation of performance of the four models.

As compared to that, the cross-modal fusion model introduced in this paper combines semantic and structural features in an attention mechanism that surpasses all the models with the best testing accuracy of 71.06% and maintaining a Macro F1 score of 57.27%, proving that the cross-modal fusion has improvement in the model’s resilience to various kinds of errors.

From the confusion matrix in Figure 3 of the fusion model, it can be seen that the RoBERTa baseline model performs well on Runtime Error, but there is significant confusion between Index Error and Module Import Error, indicating that the model has limited semantic ability to distinguish between these two types of errors. The RoBERTa with GNN model severely misjudges on IndexError and often predicts it as a Runtime Error, indicating that relying solely on structural features may weaken its ability to capture fine-grained semantic clues. In contrast, the model with the cross-modal fusion technique performs more evenly on various types of errors, resulting in a decrease in the overall misclassification rate, while maintaining high accuracy for Runtime Errors, it also enhances the ability to recognize Index Errors, verifying the effectiveness of semantic and structural information fusion.

Figure 3. Confusion Matrix of the four models. Confusion Matrix of the four models. Confusion matrix of four models on the error type classification task, including RoBERTa (upper left), RoBERTa with GNN with direct fusion (upper right), RoBERTa with GNN with channel attention applied (bottom left), and RoBERTa with GNN with cross-modal fusion (bottom right). The horizontal axis represents the predicted categories, and the vertical axis represents the true categories. The values along the diagonal indicate the number of correctly classified samples. GNN: Graph Neural Network.

Overall, confusion matrix analysis indicates that semantic information plays a key role in distinguishing errors, while structural features can play a beneficial complementary role under the premise of reasonable fusion. The fusion model demonstrates stronger category discrimination ability and lower misclassification rate, further confirming the advantages of cross-modal representation learning.

To further evaluate the discriminative ability of the model, we employed the one-vs-rest strategy to compute and compare the Receiver Operating Characteristic (ROC) curves and their corresponding Area Under the Curve (AUC) values for each error category. As shown in Figure 4, among all categories, the ROC curves of the model with cross-modal fusion are generally closer to the upper left corner of the plot, indicating a better trade-off between true positive rate and false positive rate. For the classification of Runtime Errors, the AUC values of all four models are high, which is consistent with the previously reported F1 scores, suggesting that the models exhibit high confidence and consistency in identifying such errors. In contrast, the AUC scores for IndexError and ModuleImportError are comparatively lower. Notably, the model with GNN only achieve much lower AUC on IndexError, which is consistent with its lower F1 scores. The improved AUC in these categories by the cross-modal fusion model reveals that semantic and structural representation fusion can enhance the discriminative confidence of the model in handling more challenging classification tasks.

Figure 4. ROC Curves of the four models. ROC Curves of the four models. ROC curves for three error types (IndexError, ModuleImportError, RuntimeError) comparing RoBERTa (light blue line), RoBERTa with GNN with direct fusion (light orange line), RoBERTa with GNN with channel attention applied (light purple line), and RoBERTa with GNN with cross-modal fusion (light green line). The cross-modal model shows the best or equal AUC across all categories. ROC: Receiver Operating Characteristic; AUC: Area Under the Curve; GNN: Graph Neural Network.

These findings suggest that semantic features play a dominant role in the identification of easily recognizable errors such as Runtime Error, while structural features serve as a complementary role in the classification of more complex categories.

5. Discussion

In this study, the proposed cross-modal fusion model demonstrates clear advantages in Python error classification tasks. Quantitatively, the model achieves 71.06% accuracy, outperforming the best baseline model that only use LLM. Transformer-based approaches excel in semantic feature extraction but struggle with structural dependencies, while GNN-based approaches model structural relations effectively but cannot fully capture natural language patterns embedded in error messages. The proposed cross-modal fusion mechanism leverages the complementary strengths of both modalities, reducing systematic misclassification. The confusion matrix analysis verifies this improvement, the model with cross-modal fusion showing a reduction of misclassification across all of the classes compared with model with other fusion techniques.

This study has several limitations that point toward promising directions for future research. Firstly, the training and evaluation of the proposed model are conducted exclusively on the PyTraceBugs dataset. Since the model’s performance is closely tied to the characteristics of these dataset, its adaptability to other error classification scenarios or different types of program analysis tasks remains uncertain. Future work should explore applying or fine-tuning the model on additional datasets and related classification problems to better assess its robustness and generalization ability across diverse code environments. Secondly, the model relies heavily on high-quality labeled data for training.This dependence limits the model’s scalability and may hinder its performance in low resource tasks. Future research can explore semi-supervised technique to reduce the reliance on labeled data and improve generalization under limited supervision. Finally, future work may explore the use of different graph neural network architectures on these dataset to better understand their structural modeling capabilities and suitability for code error classification. Further investigation into fusion techniques between GNNs and LLMs may enable the model to more effectively leverage complementary semantic and structural information.

6. Conclusions

In this article, we introduce a cross-modal fusion model for classifying errors in code. The model fuses the semantic features extracted by the RoBERTa model and the code structure information modeled by GNNs. This model is aimed to simultaneously capture natural language patterns and their structural dependencies in code, thereby enhancing the ability to identify error types.

We conducted extensive experiments on Python error datasets to validate the effectiveness of the proposed model. Experimental results show that the proposed model achieves the highest performance among all evaluated baselines, reaching 71.06% test accuracy and a 57.27 macro-F1 score. It outperforms the best single-modality model by 1.75% and surpasses all other fusion-based approaches. This model demonstrates higher discriminative ability especially on semantically similar and difficult distinguish error types such as IndexError and ModuleImportError. Through the analysis of the confusion matrix and ROC-AUC curve of the models, it can be concluded that cross-modal fusion not only improved the model’s ability to distinguish various errors in code, but also reduced systematic misclassification, indicating that the fusion of semantic and structural information has a positive impact on the model’s robustness and generalization ability.

This study suggests that a reasonable and refined fusion mechanism is crucial for achieving synergistic gains between LLM and GNN. Only with specific and appropriate fusion methods can the model have advantages in the field of code error detection. The design of the fusion strategy should fully consider the differences in structure, semantics, and information transmission mechanisms between the two types of models, promoting complementary and adaptive features.

Author Contributions

Conceptualization, Y.Z. and J.M.; methodology, Y.Z.; software, Y.Z., W.L., and J.C.; validation, Y.Z., W.L., F.J., J.M. and J.C.; formal analysis, F.J.; data curation, Y.Z., F.J. and J.M.; Writing—original draft, Y.Z.; Writing—review and editing, Y.Z., W.L., F.J., J.M. and J.C.; Supervision, J.M.; Funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study received funding from Changchun Changguang Orion Optoelectronic Technology Co., Ltd. Author Jingtai Cao from this company guided the research direction, supervised the completion of the manuscript, and participated in the revision and finalization of the article.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy and confidentiality restrictions of the laboratory.

Conflicts of Interest

Jingtai Cao is employee of Changchun Changguang Orion Optoelectronic Technology Co., Ltd. The paper reflects the views of the scientists and not the company. The authors declare no conflicts of interest.

References

Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Ray, B.; Posnett, D.; Filkov, V.; Devanbu, P. A Large Scale Study of Programming Languages and Code Quality in GitHub. In Proceedings of the 22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), Seattle, WA, USA, 13–18 November 2016; Volume 22, pp. 1–12. [Google Scholar]
White, M.; Vendome, C.; Linares-Vásquez, M.; Poshyvanyk, D. Toward Deep Learning Software Repositories. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), Florence, Italy, 16–17 May 2015; Volume 12, pp. 1–10. [Google Scholar]
Pradel, M.; Sen, K. DeepBugs: A Learning Approach to Name-Based Bug Detection. Proc. ACM Program. Lang. 2018, 2, 1–25. [Google Scholar] [CrossRef]
Sun, C.; Huang, X.; Lo, D.; Liu, X. Bugbuster: Identifying and Fixing Software Bugs Using Semantic Learning. In Proceedings of the 43rd International Conference on Software Engineering (ICSE), Virtually, 23–29 May 2021; Volume 43, pp. 1–15. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Zhang, X.; Qiu, X.; Fan, A.; Tolias, A.; Wenzek, G.; Cheng, H.; et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 1536–1547. [Google Scholar]
Chen, H.; Zhang, Y.; Han, X.; Rong, H.; Zhang, Y.; Mao, T.; Zhang, H.; Wang, X.; Xing, L.; Chen, X. WitheredLeaf: Finding Entity-Inconsistency Bugs with LLMs. arXiv 2024, arXiv:2405.01668. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Volume 1, pp. 1746–1751. [Google Scholar]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, WA, USA, 18–21 October 2013; Volume 2, pp. 1631–1642. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; Volume 3, pp. 1–15. [Google Scholar]
Chen, D.; Fisch, A.; Weston, J.; Bordes, A.; Buchwalter, W.; Weston, R.; Gardner, M. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; Volume 55, pp. 1870–1879. [Google Scholar]
Austin, T.; Smith, X.; Brown, Y. Static Analysis for Software Bug Detection: A Review. ACM Comput. Surv. 2018, 51, 1–45. [Google Scholar]
Zhang, Y.; Zhao, J. Dynamic Debugging Techniques for Error Detection. Int. J. Softw. Eng. Appl. 2017, 8, 15–25. [Google Scholar]
Jiang, L.; Zhou, Y. A Survey of Static Analysis Techniques for Software Bug Detection. ACM Comput. Surv. 2011, 43, 1–37. [Google Scholar]
Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Anthropic. Claude: An AI Assistant for Thoughtful Conversations. In Anthropic AI Blog; Anthropic: San Francisco, CA, USA, 2023. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
Touvron, H.; Martin, J.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, R.; Bhosale, S.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.A.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Alrashedy, K. Language Models are Better Bug Detectors Through Code-Pair Classification. arXiv 2023, arXiv:2311.07957. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How Powerful are Graph Neural Networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Zhou, Y.; Liu, S.; Siow, J.; Du, X.; Liu, Y. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Fraser, G.; Arcuri, A. EvoSuite: A Search-Based Unit Test Generation Tool for Java. In Proceedings of the 2012 ACM International Symposium on Software Testing and Analysis (ISSTA), Minneapolis, MN, USA, 15–20 July 2012; pp. 234–244. [Google Scholar]
Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D. Compilers: Principles, Techniques, and Tools, 2nd ed.; Addison-Wesley: Boston, MA, USA, 1986. [Google Scholar]
Briem, J.A.; Smit, J.; Sellik, H.; Rapoport, P. Using Distributed Representation of Code for Bug Detection. arXiv 2019, arXiv:1911.12863. [Google Scholar] [CrossRef]
Ferrante, J.; Ottenstein, K.J.; Warren, J.D. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 1987, 9, 319–349. [Google Scholar] [CrossRef]
Lam, A.N.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. Bug Localization with Combination of Deep Learning and Information Retrieval. In Proceedings of the 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, 22–23 May 2017; pp. 218–229. [Google Scholar]
Brockington, M.A. Control Flow Graphs for Flow of Control Analysis. ACM SIGPLAN Not. 1970, 5, 88–97. [Google Scholar]
Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. In Information Processing & Management; Elsevier: Amsterdam, The Netherlands, 1988; pp. 171–180. [Google Scholar]
Zhang, C.; Yang, Q. A Generalized Cross-Entropy Loss for Imbalanced Classification. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), Singapore, 17–20 November 2018; pp. 800–809. [Google Scholar]
Akimova, E.N.; Bersenev, A.Y.; Deikov, A.A.; Kobylkin, K.S.; Konygin, A.V.; Mezentsev, I.P.; Misilov, V.E. PyTraceBugs: A Large Python Code Dataset for Supervised Machine Learning in Software Defect Prediction. In Proceedings of the 2021 28th Asia-Pacific Software Engineering Conference (APSEC), Taipei, Taiwan, 6–9 December 2021; pp. 141–151. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. The Architecture of the model. The LLMs driven Syntax feature extractor component is shown in the gray part, while the GNN driven Structure feature extractor component is shown in the light green part. This picture shows the model with the cross-modal fusion which is shown in the light orange part.

Figure 2. Comparison of the precision of the four models. Precision comparison across three error categories—RuntimeError (left), ModuleImportError (middle), and IndexError (right)—for four model configurations: RoBERTa, RoBERTa with GNN, RoBERTa with GNN (Channel Attention), and RoBERTa with GNN (Cross-Modal Fusion). Each group of bars represents the precision achieved by the respective models on each error type. GNN: Graph Neural Network.

Figure 3. Confusion Matrix of the four models. Confusion Matrix of the four models. Confusion matrix of four models on the error type classification task, including RoBERTa (upper left), RoBERTa with GNN with direct fusion (upper right), RoBERTa with GNN with channel attention applied (bottom left), and RoBERTa with GNN with cross-modal fusion (bottom right). The horizontal axis represents the predicted categories, and the vertical axis represents the true categories. The values along the diagonal indicate the number of correctly classified samples. GNN: Graph Neural Network.

Figure 4. ROC Curves of the four models. ROC Curves of the four models. ROC curves for three error types (IndexError, ModuleImportError, RuntimeError) comparing RoBERTa (light blue line), RoBERTa with GNN with direct fusion (light orange line), RoBERTa with GNN with channel attention applied (light purple line), and RoBERTa with GNN with cross-modal fusion (light green line). The cross-modal model shows the best or equal AUC across all categories. ROC: Receiver Operating Characteristic; AUC: Area Under the Curve; GNN: Graph Neural Network.

Table 1. Evaluation of performance of the four models.

Model	Test Accuracy	Macro-Averaged F1 Score
RoBERTa	69.35	57.34
RoBERTa with Graph Neural Network (Direct Fusion)	68.79	52.29
RoBERTa with Graph Neural Network (Channel Attention)	69.01	53.33
RoBERTa with Graph Neural Network (Cross-Modal Fusion)	71.06	57.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Fusing Semantic and Structural Features for Code Error Detection

Abstract

1. Introduction

3. Methodology

3.1. Network Overview

3.1.1. Semantic Feature Extraction with RoBERTa

3.1.2. Structural Feature Extraction with GNN

3.1.3. Feature Fusion and Classification

3.2. Loss Function and Class Imbalance Handling

4. Experiment

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Implemetation Details

4.2. Evaluation Matrices

4.3. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Fusing Semantic and Structural Features for Code Error Detection

Abstract

1. Introduction

2. Related Work

2.1. Code Error Detection

2.2. LLMs in Code Error Detection

2.3. GNNs in Code Error Detection

2.4. Integration of LLMs and GNNs

3. Methodology

3.1. Network Overview

3.1.1. Semantic Feature Extraction with RoBERTa

3.1.2. Structural Feature Extraction with GNN

3.1.3. Feature Fusion and Classification

3.2. Loss Function and Class Imbalance Handling

4. Experiment

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Implemetation Details

4.2. Evaluation Matrices

4.3. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics