Early-Stage Graph Fusion with Refined Graph Neural Networks for Semantic Code Search
Abstract
1. Introduction
- 1.
- We propose an early fusion strategy that integrates the AST, DDG, and CFG of code statements to construct a functional program graph, enhancing the representation of code features.
- 2.
- For code representation, we design the IMAGNN that constructs metapath-associated subgraphs to mitigate information loss and redundant aggregation, while substituting attention with mean pooling to reduce computational overhead without compromising representational fidelity on heterogeneous graphs.
- 3.
- Extensive empirical evaluation on the publicly available CodeSearchNet benchmark demonstrates that FPGraphCS attains statistically significant and consistent gains over state-of-the-art baselines, thereby substantiating its superior semantic precision and robustness in code search.
2. Related Work
2.1. Text-Based Code Representation
2.2. Structural Feature-Based Code Representation
2.3. Heterogeneous Graph Representation Learning Methods
3. Methodology
3.1. Model Framework
3.2. Functional Program Graph
- V is the set of nodes, , with and . Each node represents a statement in the code.
- E is the set of directed edges, where each edge is represented as , where , and are arbitrary nodes in V, and denotes the type of edge.
- indicates that the edge type is a control relationship. indicates that the edge type is a data dependency. indicates that the edge type is an abstract syntax structure.
- If , then , edge represents the CFG.
- If , and , then , edge represents the DDG.
- If , and and , then , edge represents the AST.
| Algorithm 1 Building functional program graph. |
| Require: CFG , DDG , AST Ensure: Functional program graph
|
3.3. Code Feature Extraction
3.3.1. Node Statement Feature Extraction
3.3.2. Node Statement Context Feature Extraction
3.3.3. Feature Fusion
3.4. Model Training
4. Experiments
4.1. Dataset
- 1.
- Code snippets for which Abstract Syntax Tree, Control Flow Graph, or Data Dependency Graph could not be extracted were removed. Such failures typically arise due to syntax errors that impede the parsing tools.
- 2.
- Lambda expressions present in Java code were excluded because the extraction tool Progex does not support features introduced in Java JDK versions above 1.7.
- 3.
- Samples with node counts of zero or exceeding 600 were discarded. A node count of zero indicates an empty function body, whereas the upper threshold was set to 600 to optimize algorithm runtime performance.
- 4.
- Superfluous characters and symbols (e.g., \n, \t, []), as well as numeric literals that bear no semantic significance within the code, were removed since their presence adversely impacts matching accuracy.
- 5.
- Functions containing only comments or non-executable code were excluded. These functions, often used as placeholders or documentation, do not contribute to meaningful code retrieval and could degrade the dataset’s quality.
- 6.
- Functions, variables, and statements following camelCase or snake_case naming conventions were tokenized into individual units. This segmentation reduces vocabulary size and improves semantic clarity, enabling the model to process each component of the identifier more efficiently.
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Results and Analysis
4.4.1. Performance Study
4.4.2. Baseline Rationale
4.4.3. Ablation Study
4.4.4. Sensitivity Analysis
4.4.5. Time Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mao, Y.; Wan, C.; Jiang, Y.; Gu, X. Self-Supervised Query Reformulation for Code Search. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 363–374. [Google Scholar]
- Lv, F.; Zhang, H.; Lou, J.-G.; Wang, S.; Zhang, D.; Zhao, J. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, Lincoln, NE, USA, 9–13 November 2015; pp. 260–270. [Google Scholar]
- Nie, L.; Jiang, H.; Ren, Z.; Sun, Z.; Li, X. Query Expansion Based on Crowd Knowledge for Code Search. IEEE Trans. Serv. Comput. 2016, 9, 771–783. [Google Scholar] [CrossRef]
- Wang, C.; Nong, Z.; Gao, C.; Li, Z.; Zeng, J.; Xing, Z.; Liu, Y. Enriching Query Semantics for Code Search with Reinforcement Learning. arXiv 2021, arXiv:2105.09630. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Tang, D.; Shou, L.; Gong, M.; Xu, K.; Jiang, D.; Zhou, M.; Duan, N. CoSQA: 20,000+ Web Queries for Code Search and Question Answering. arXiv 2021, arXiv:2105.13239. [Google Scholar]
- Xie, Y.; Lin, J.; Dong, H.; Zhang, L.; Wu, Z. Survey of Code Search Based on Deep Learning. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–42. [Google Scholar] [CrossRef]
- Gao, X.; Jiang, X.; Wu, Q.; Wang, X.; Lyu, C.; Lyu, L. GT-SimNet: Improving Code Automatic Summarization via Multi-Modal Similarity Networks. J. Syst. Softw. 2022, 194, 111495. [Google Scholar] [CrossRef]
- Niu, C.; Li, C.; Ng, V.; Ge, J.; Huang, L.; Luo, B. SPT-code: Sequence-to-Sequence Pre-training for Learning Source Code Representations. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 22–27 May 2022; pp. 2006–2018. [Google Scholar]
- Gu, W.; Li, Z.; Gao, C.; Wang, C.; Zhang, H.; Xu, Z.; Lyu, M. CRaDLe: Deep Code Retrieval Based on Semantic Dependency Learning. Neural Netw. 2021, 141, 385–394. [Google Scholar] [CrossRef]
- Fu, X.; Zhang, J.; Meng, Z.; King, I. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2331–2341. [Google Scholar]
- Husain, H.; Wu, H.-H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv 2019, arXiv:1909.09436. [Google Scholar]
- Sun, W.; Fang, C.; Ge, Y.; Hu, Y.; Chen, Y.; Zhang, Q.; Ge, X.; Liu, Y.; Chen, Z. A Survey of Source Code Search: A 3-Dimensional Perspective. ACM Trans. Softw. Eng. Methodol. 2024, 33, 166. [Google Scholar] [CrossRef]
- Salza, P.; Schwizer, C.; Gu, J.; Gall, H.C. On the Effectiveness of Transfer Learning for Code Search. IEEE Trans. Softw. Eng. 2023, 49, 1804–1822. [Google Scholar] [CrossRef]
- Chen, J.; Hu, X.; Li, Z.; Gao, C.; Xia, X.; Lo, D. Code Search Is All You Need? Improving Code Suggestions with Code Search. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 73:1–73:13. [Google Scholar]
- Gu, X.; Zhang, H.; Kim, S. Deep Code Search. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 933–944. [Google Scholar]
- Sachdev, S.; Li, H.; Luan, S.; Kim, S.; Sen, K.; Chandra, S. Retrieval on Source Code: A Neural Code Search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, PA, USA, 18 June 2018; pp. 31–41. [Google Scholar]
- Xu, L.; Yang, H.; Liu, C.; Shuai, J.; Yan, M.; Lei, Y.; Xu, Z. Two-Stage Attention-Based Model for Code Search with Textual and Structural Features. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Luxembourg, 9–12 March 2021; pp. 342–353. [Google Scholar]
- Hu, Y.; Cai, B.; Yu, Y. CSSAM: Code Search via Attention Matching of Code Semantics and Structures. arXiv 2022, arXiv:2208.03922. [Google Scholar]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv 2020, arXiv:2009.08366. [Google Scholar]
- Zhao, W.; Liu, Y. Utilising Edge Attention in Graph-Based Code Search. In Proceedings of the 34th International Conference on Software Engineering and Knowledge Engineering, SEKE 2022, Pittsburgh, PA, USA, 1–10 July 2022; pp. 60–66. [Google Scholar]
- Wan, Y.; Shu, J.; Sui, Y.; Xu, G.; Zhao, Z.; Wu, J.; Yu, P.S. Multi-Modal Attention Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2019; pp. 13–25. [Google Scholar]
- Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26 June–1 July 2015; pp. 1556–1566. [Google Scholar]
- Gu, J.; Chen, Z.; Monperrus, M. Multimodal Representation for Neural Code Search. arXiv 2021, arXiv:2107.00992. [Google Scholar]
- Zeng, C.; Yu, Y.; Li, S.; Xia, X.; Wang, Z.; Geng, M.; Bai, L.; Dong, W.; Liao, X. deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search. ACM Trans. Softw. Eng. Methodol. 2023, 32, 34. [Google Scholar] [CrossRef]
- Liu, S.; Xie, X.; Siow, J.; Ma, L.; Meng, G.; Liu, Y. GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search. IEEE Trans. Softw. Eng. 2023, 49, 2839–2855. [Google Scholar] [CrossRef]
- Yang, X.; Yan, M.; Pan, S.; Ye, X.; Fan, D. Simple and Efficient Heterogeneous Graph Neural Network. arXiv 2023, arXiv:2207.02547. [Google Scholar] [CrossRef]
- Fu, X.; King, I. MECCH: Metapath Context Convolution-based Heterogeneous Graph Neural Networks. Neural Netw. 2024, 170, 266–275. [Google Scholar] [CrossRef]
- Zhou, W.; Huang, H.; Shi, R.; Yin, K.; Jin, H. An Efficient Subgraph-Inferring Framework for Large-Scale Heterogeneous Graphs. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI 2024), Vancouver, BC, Canada, 20–27 February 2024; pp. 9431–9439. [Google Scholar]
- Ling, X.; Wu, L.; Wang, S.; Pan, G.; Ma, T.; Xu, F.; Liu, A.X.; Wu, C.; Ji, S. Deep Graph Matching and Searching for Semantic Code Retrieval. ACM Trans. Knowl. Discov. Data 2021, 15, 88. [Google Scholar] [CrossRef]
- Li, J.; Peng, H.; Cao, Y.; Dou, Y.; Zhang, H.; Yu, P.S.; He, L. Higher-Order Attribute-Enhancing Heterogeneous Graph Neural Networks. IEEE Trans. Knowl. Data Eng. 2023, 35, 560–574. [Google Scholar] [CrossRef]
- Schlichtkrull, M.; Kipf, T.N.; Bloem, P.; van den Berg, R.; Titov, I.; Welling, M. Modeling Relational Data with Graph Convolutional Networks. arXiv 2017, arXiv:1703.06103. [Google Scholar] [CrossRef]
- Wang, X.; Ji, H.; Shi, C.; Wang, B.; Cui, P.; Yu, P.S.; Ye, Y. Heterogeneous Graph Attention Network. arXiv 2021, arXiv:1903.07293. [Google Scholar]
- Hu, F.; Wang, Y.; Du, L.; Li, X.; Zhang, H.; Han, S.; Zhang, D. Revisiting Code Search in a Two-Stage Paradigm. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 994–1002. [Google Scholar]
- Zhu, Q.; Sun, Z.; Liang, X.; Xiong, Y.; Zhang, L. OCoR: An Overlapping-Aware Code Retriever. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, 21–25 September 2020; pp. 883–894. [Google Scholar]
- Ferrante, J.; Ottenstein, K.J.; Warren, J.D. The Program Dependence Graph and Its Use in Optimization. ACM Trans. Program. Lang. Syst. 1987, 9, 319–349. [Google Scholar] [CrossRef]
- Deng, Z.; Xu, L.; Liu, C.; Yan, M.; Xu, Z.; Lei, Y. Fine-Grained Co-Attentive Representation Learning for Semantic Code Search. In Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 15–18 March 2022; pp. 396–407. [Google Scholar]
- Shuai, J.; Xu, L.; Liu, C.; Yan, M.; Xia, X.; Lei, Y. Improving Code Search with Co-Attentive Representation Learning. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 5–11 October 2020; pp. 196–207. [Google Scholar]
- Gu, W.; Wang, Y.; Du, L.; Zhang, H.; Han, S.; Zhang, D.; Lyu, M. Accelerating Code Search with Deep Hashing and Code Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 2534–2544. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
- Guo, D.; Lu, S.; Duan, N.; Wang, Y.; Zhou, M.; Yin, J. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. arXiv 2022, arXiv:2203.03850. [Google Scholar]






| Model | DBLP (macro-F1) | DBLP (micro-F1) | IMDB (macro-F1) | IMDB (micro-F1) |
|---|---|---|---|---|
| MAGNN | 93.61 | 94.07 | 60.79 | 60.93 |
| MAGNN * | 93.82 | 94.23 | 61.21 | 61.26 |
| MAGNN † | 93.27 | 93.61 | 60.12 | 60.30 |
| Language | w/Documentation | All |
|---|---|---|
| Go | 347,789 | 726,768 |
| Java | 542,991 | 1,569,889 |
| JavaScript | 157,988 | 1,857,835 |
| PHP | 717,313 | 977,821 |
| Python | 503,502 | 1,156,085 |
| Ruby | 57,393 | 164,048 |
| All | 2,326,976 | 6,452,446 |
| Dataset | Sample Size (Percentage) |
|---|---|
| Training set | 393,008 (91.64%) |
| Validation set | 12,608 (2.94%) |
| Test set | 23,251 (5.42%) |
| Dataset | Nodes | Edges | Node Types |
|---|---|---|---|
| DBLP | 26,128 | 119,783 | 4 |
| IMDB | 11,616 | 17,106 | 3 |
| Method | MRR | ACC@1 | ACC@5 | ACC@10 |
|---|---|---|---|---|
| NCS | 0.367 | 0.288 | 0.454 | 0.455 |
| DeepCS | 0.461 | 0.358 | 0.579 | 0.659 |
| NBoW | 0.544 | 0.447 | 0.660 | 0.725 |
| MMAN | 0.494 | 0.381 | 0.630 | 0.719 |
| DGMS | 0.465 | 0.339 | 0.613 | 0.706 |
| MRNCS | 0.596 | 0.503 | 0.706 | 0.768 |
| FPGraphCS | 0.651 | 0.549 | 0.775 | 0.842 |
| Method | MRR | ACC@1 | ACC@5 | ACC@10 |
|---|---|---|---|---|
| FPGraphCS–CFG | 0.624 | 0.522 | 0.750 | 0.821 |
| FPGraphCS–AST | 0.629 | 0.521 | 0.575 | 0.789 |
| FPGraphCS–DDG | 0.529 | 0.481 | 0.719 | 0.790 |
| FPGraphCS | 0.642 | 0.539 | 0.767 | 0.834 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ao, L.; Qi, R. Early-Stage Graph Fusion with Refined Graph Neural Networks for Semantic Code Search. Appl. Sci. 2026, 16, 12. https://doi.org/10.3390/app16010012
Ao L, Qi R. Early-Stage Graph Fusion with Refined Graph Neural Networks for Semantic Code Search. Applied Sciences. 2026; 16(1):12. https://doi.org/10.3390/app16010012
Chicago/Turabian StyleAo, Longhao, and Rongzhi Qi. 2026. "Early-Stage Graph Fusion with Refined Graph Neural Networks for Semantic Code Search" Applied Sciences 16, no. 1: 12. https://doi.org/10.3390/app16010012
APA StyleAo, L., & Qi, R. (2026). Early-Stage Graph Fusion with Refined Graph Neural Networks for Semantic Code Search. Applied Sciences, 16(1), 12. https://doi.org/10.3390/app16010012

