Neural Methods for Programming: A Comprehensive Survey and Future Directions

Maru, Gebremedhin Gebreslassie; Lee, Sanghwa; Ji, Suhwan; Ko, Sang-Ki; Im, Hyeonseung

doi:10.3390/app152212150

Open AccessReview

Neural Methods for Programming: A Comprehensive Survey and Future Directions

by

Gebremedhin Gebreslassie Maru

¹

,

Sanghwa Lee

²

,

Suhwan Ji

^3,*

,

Sang-Ki Ko

⁴

and

Hyeonseung Im

^5,*

¹

Department of Computer Science, Kangwon National University, Chuncheon 24341, Republic of Korea

²

Department of Data Science, Kangwon National University, Chuncheon 24341, Republic of Korea

³

Theory of Computation Laboratory, Yonsei University, Seoul 03722, Republic of Korea

⁴

Department of Artificial Intelligence, University of Seoul, Seoul 02504, Republic of Korea

⁵

Department of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12150; https://doi.org/10.3390/app152212150

Submission received: 10 October 2025 / Revised: 8 November 2025 / Accepted: 11 November 2025 / Published: 16 November 2025

(This article belongs to the Special Issue Artificial Intelligence in Software Engineering)

Download

Browse Figures

Versions Notes

Abstract

The advancement of neural-based models has driven significant progress in modern code intelligence, accelerating the development of intelligent programming tools such as code assistants and automated software engineering systems. This study presents a comprehensive and systematic survey of neural methods for programming tasks within the broader context of software development. Guided by six research questions, this study synthesizes insights from more than 250 scientific papers, the majority of which were published between 2015 and 2025, with earlier foundational works (dating back to the late 1990s) included for historical context. The analysis spans 18 major programming tasks, including code generation, code translation, code clone detection, code classification, and vulnerability detection. The survey methodologically examines the development and evolution of neural approaches, the datasets employed, and the performance evaluation metrics adopted in this field. It traces the progress in neural techniques from early code modeling approaches to advanced Code-specific Large Language Models (Code LLMs), emphasizing their advantages over traditional rule-based and statistical methods. A taxonomy of evaluation metrics and a categorized summary of datasets and benchmarks reveal both progress and persistent limitations in data coverage and evaluation practices. The review further distinguishes neural models designed for natural language processing and programming languages, highlighting the structural and functional characteristics that influence model performance. Finally, the study discusses emerging trends, unresolved challenges, and potential research directions, underscoring the transformative role of neural-based architectures, particularly Code LLMs, in enhancing programming and software design activities and shaping the future of AI-driven software development.

Keywords:

software engineering; neural networks; code translation; code representation and modeling; code-specific Large Language Models (LLMs); datasets and evaluation metrics

1. Introduction

Artificial intelligence (AI) and its subfields, such as machine learning (ML), deep learning (DL), and generative AI, have become integral to modern activities by improving efficiency and quality in daily tasks. DL-based natural language processing (NLP) enables automation beyond human capabilities [1], including language-related tasks such as autocompletion, long-form text predictions [2], and multilingual translation. The fundamental building blocks of such advancements in the field are neural networks. Hence, they can be considered key drivers of AI-based automation worldwide, enhancing various aspects of daily life. Applications of neural network-based language models, such as ChatGPT [3], showcase improvements in NLP. However, programming, which involves structured instructions to solve computational problems, remains a challenging domain due to its strict syntax, logical complexity, and complexity of semantic analysis. Tasks like writing new code, translating or refactoring existing code, and debugging have traditionally been performed manually or with rigid tools for decades, often struggling to keep up with dynamic and evolving project requirements. This complexity has driven interest in neural approaches for programming tasks, as data-driven models can learn intricate patterns and generalize more effectively to novel problems. Recently, state-of-the-art (SOTA) NLP models—such as OpenAI’s Generative Pre-trained Transformer (GPT) models [3]—have been adapted for programming language (PL) tasks [4], and Code-specific Large Language Models (Code LLMs) have been tailored for code-related tasks [5], thereby reducing traditionally time-consuming manual processes such as code editing, comment writing, summarization, and debugging while enabling automation in a range of programming tasks, including source code translation [6,7], code generation [5], summarization [8], automatic code editing [9], decompilation [10], and code similarity detection [11], among others.

Despite the active utilization of neural-based NLP techniques in code-related tasks, there remains a noticeable gap in comprehensive analyses examining the rapid and evolving integration of neural methods into programming-focused applications. Although previous surveys have examined scientific research on neural methods for programming tasks, they often suffer from a limited scope—either focusing on specific tasks [12,13], covering only a narrow subset of neural network-based techniques used in these tasks [14], emphasizing general AI/ML methods with minimal attention to neural techniques [15], or concentrating on broader software engineering (SE) applications with insufficient focus on the programming-centric tasks. Some reviews [16,17,18] classify the applications of neural models on code into three categories—code generation models, representational models, and pattern mining (code understanding) models—with comparisons of their usage in NLP and programming tasks, whereas others analyze the role of neural models in code understanding, by distinguishing between sequence-based and graph-based approaches. However, many [19,20,21] omit discussions on the role of neural methods specifically tailored to programming tasks, overlooking the challenges these methods face in capturing the complex syntactic and semantic structures of code and the logical idiom constraints unique to each programming language.

Given the limitations of previous surveys [22,23,24,25,26]—such as their task-specific focus (e.g., code summarization and program repair) and the omission of critical discussions [27,28]—researchers may struggle to identify gaps for further exploration in research themes targeting neural networks applied to programming tasks. To address this, our survey consolidates previous surveys and provides a comprehensive review of existing studies, bridging research gaps and guiding future investigations. In summary, this survey outlines the following concepts and contributions:

Review existing survey papers on neural methods for programming tasks.
Compare neural methods applied to NLs and PLs.
Compare rule-based, statistical, and neural methods for code-related tasks.
Review 18 programming tasks and outline future research directions for each.
Examine datasets and benchmarks used in these tasks.
Analyze performance evaluation metrics for neural models in the reviewed papers.
Summarize key findings from the reviewed papers on neural models applied to source code.
Discuss major challenges in applying neural-based models for programming tasks.
Explore the role of LLMs in programming.
Identify research gaps and propose future research directions.

This survey serves as a roadmap for researchers interested in the applications of neural networks to programming languages and programming tasks, highlighting the evolution of neural networks from early single-layer models [29] to modern large-scale architectures [30] as depicted in Figure 1. By providing this historical perspective, we contextualize recent achievements and outline directions for future research. To the best of our knowledge, no existing survey is solely dedicated to neural methods with a comprehensive review tracing their application to programming tasks, from early developments to the latest advancements in Code LLMs trained exclusively on code. To maintain consistency and enhance clarity throughout our analysis, we use the term programming tasks to refer to the various code-related activities discussed in this study. While different literature employs a range of terms, such as SE tasks [14], coding tasks [31], code-related tasks, and programming language tasks [32], to describe similar or overlapping concepts, we adopt a unified terminology to reduce ambiguity for readers. However, given the breadth and interdisciplinary nature of these tasks, often involving source code, bug reports, and natural language artifacts, we occasionally use the term SE tasks when appropriate.

Our research scope concentrates on neural methods applied specifically to programming-related SE, as many other SE activities can often be addressed through general-purpose NLP and deep learning surveys. Consequently, this work represents the first comprehensive and systematic review dedicated exclusively to examining how neural approaches have been applied within programming-centric SE domains. To guide this study, we propose the following research questions:

RQ1: How do neural approaches compare to rule-based and statistical methods in the context of programming tasks?
RQ2: What is the current landscape of datasets and benchmarks for neural methods in programming tasks, and what are the critical gaps?
RQ3: Which evaluation metrics best capture model performance on code, both syntactically and functionally, and where do standard NLP metrics fall short?
RQ4: What roles do LLMs (e.g., GPT-4, LLaMA, Claude) play in programming tasks?
RQ5: What are the main bottlenecks in scaling and deploying neural-based programming solutions to real-world codebases, and how can they be addressed?
RQ6: How have neural methods for programming evolved, and which model- and system-level advances have driven this progression?

Finally, we emphasize that our survey follows established systematic review practices—we defined a set of search queries, inclusion/exclusion criteria, and we applied consistent paper collection procedures as noted in Section 3. Our methodology is based on standard guidelines for software engineering systematic literature reviews (e.g., Brereton et al. [33] and Kitchenham and Charters [34]). We extend these practices by including quantitative and qualitative trend analysis to enhance comprehensiveness and rigor, ensuring that our review is reproducible.

To facilitate the discussion, the rest of the paper is structured as follows. Section 2 reviews existing surveys. Section 3 outlines the paper selection process. Section 4 compares the neural models used for NLs and PLs. Section 5 presents neural methods for 18 programming tasks. Section 6 reviews the datasets and evaluation metrics. Section 7 compares neural methods with traditional methods such as rule-based and statistical methods. Section 8 presents answers to the research questions and outlines future research directions, while Section 9 concludes.

2. Related Works

This section reviews prior surveys examining AI approaches in programming and SE, particularly the role of neural methods in programming-related tasks. While some works broadly address AI in SE, they often overlook the huge gains and critical technical challenges specific to neural methods on code-focused applications. Al-Hossami and Shaikh [27] present a taxonomy of AI methods applied to source code and conversational SE systems, spanning traditional neural models such as Multilayer Perceptron, Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) to Transformers. They categorize tasks into open-domain and task-oriented dialogue systems, referencing datasets like CoNaLa, CodeSearchNet, and other GitHub-derived corpora. Similarly, Watson et al. [35] review DL utilization across SE tasks including defect prediction, program comprehension, and bug fixing. They identify key neural architectures such as autoencoders, Siamese Networks, encoder–decoder models, and CNNs, and common metrics like accuracy,

F_{1}

, Mean Reciprocal Rank (MRR), Recall@k, and Bilingual Evaluation Understudy (BLEU), while noting limitations include weak reproducibility, preprocessing inconsistencies, and overfitting risks across reviewed works.

Cao et al. [36] survey explainable AI in SE, classifying methods by portability (model-agnostic/specific), timing (ante hoc/post hoc), and scope (local/global). Their dataset coverage includes Devign (Deep Vulnerability Identification via Graph Neural Networks), National Vulnerability Database (NVD), StackOverflow, and CodeSearchNet. They highlight explainable AI’s stronger presence in security and defect prediction, its underutilization in early SE phases, and concerns over inconsistent evaluations and lack of standard baselines. The review in [37] emphasizes encoder–decoder architectures in code modeling and generation, exploring a wide array of code representations. Similarly, Samoaa et al. [17] systematically map how DL techniques model code using token, tree, graph, and hybrid formats across classification, clone detection, and summarization tasks. They advocate for standardized benchmarks with industry-level evaluation. Hou et al. [38] discuss the dominance of decoder-only and encoder–decoder architectures (e.g., GPT-3/4, Codex, and CodeT5) with a review on publicly available datasets like HumanEval and Mostly Basic Programming Problems (MBPP). They emphasize the role of prompt engineering and parameter-efficient tuning techniques like Low-Rank Adaptation, prefix-tuning, and adapters, though most models are validated only on academic datasets. Similarly emphasizing LLMs, Fan et al. [39] raise concerns about hallucinations, weak evaluation frameworks, and a lack of rigorous metrics for assessing generated code and other artifacts.

More narrowly scoped reviews target specific programming tasks. Akimova et al. [24] review studies on software defect prediction that utilize RNNs, LSTMs, Tree-LSTMs, GNNs, and Transformer-based models. While neural approaches outperform classical methods, challenges remain in benchmark standardization, class imbalance, and metric selection. Wu et al. [40] analyze vulnerability detection methods, comparing RNN-based and GNN-based models. They stress the need for better code embeddings, representation standards, and interpretability. Xie et al. [25] classify the research works in code search into three taxonomies such as semantic modeling of query, code semantics modeling, and semantic matching phases. Their findings show that pre-trained Transformers like CodeBERT and multi-view inputs significantly improve retrieval, while noting the high training cost and blurred task boundaries with clone detection. Dou et al. [41] empirically review LLMs for code clone detection, comparing open-source (e.g., LLaMA, Vicuna, and Falcon) and proprietary models (e.g., GPT-3.5 and GPT-4) against traditional tools like SourcererCC and NiCad. Using benchmarks such as BigCloneBench and CodeNet (Java, C++, and Python), they examine clone types (1–4) through zero-shot prompts, chain-of-thought prompting, and embedding-based methods (e.g., text-embedding-ada-002 and CodeBERT). LLMs notably outperform classical tools for Type-3/4 clones, especially for Python. GPT-4 demonstrates robustness under multi-step chain-of-thought prompting. Yet, limitations include prompt design constraints, lack of few-shot settings, potential training/evaluation overlap, and computational resource restrictions.

Zhong et al. [42] offer a taxonomy of neural program repair, emphasizing the review of automatic program repair (APR) techniques that leverage various data representation approaches, such as abstract syntax tree (AST) and control flow graph (CFG), to extract context, and employ different neural-based methods, such as encoder–decoder, tree-based, and graph-based architectures, to generate correct or plausible bug-fixing patches. Similarly, Huang et al. [43] trace the APR evolution from search-based to constraint-based, template-based, and learning-based techniques, highlighting models from RNNs and Transformers to static-analysis-integrated encodings, tested on various benchmarks like Defects4J and QuixBugs. Zhang et al. [22] also review learning-based APR tools that frame bug fixing as neural translation, identifying gaps such as limited multi-hunk fix support and inadequate benchmark standardization. Wang et al. [44] provide a broad survey of LLMs in software testing and repair, noting that decoder-only and encoder–decoder models such as GPT-3, Codex, Text-to-Text Transfer Transformer (T5), CodeT5, BART, and LLaMA often outperform traditional techniques. However, they raise caution about benchmark data leakage, privacy risks, and the computational cost of real-world deployment.

Le et al. [28] classify data-driven vulnerability assessment into five themes and subthemes, examining various AI techniques: traditional ML like Support Vector Machine (SVM) and random forest, deep learning like Siamese Neural Networks and GNNs, and knowledge graphs. They also explore datasets like ExploitDB and the Common Vulnerability Scoring Systems, recommending robust validation strategies and evaluation metrics suited for imbalanced data such as Matthews Correlation Coefficient, Mean Absolute Error, and Root Mean Square Error. With a similar research focus, Xiaomeng et al. [23] review ML-based static code analysis for vulnerability detection. Their work spans traditional models like SVMs to deep learning (CNNs, RNNs, and LSTMs), emphasizing reduced manual effort but also highlighting issues like class imbalance and poor generalizability across projects and languages.

Zakeri-Nasrabadi et al. [26] provide a taxonomy of code similarity techniques, covering token-based, tree-based, graph-based, metric-based, and learning-based approaches. They assess benchmarks such as BigCloneBench and metrics like precision, recall, and F1, noting the dominance of languages like Java and C++ and the reproducibility limitations due to some restricted tools for open access. Katsogiannis et al. [45] survey neural approaches for Text-to-SQL translation, reviewing architectures such as seq2seq models, grammar-guided decoders, and GNNs for schema linking. While pre-training-enhanced language models show strong accuracy on benchmarks like WikiSQL and Spider, challenges remain in scalability and generalization to new domains. Grazia and Pradel [46] present a comprehensive review of code search over researches spanned across three decades. They categorize query mechanisms—ranging from NL and code snippets to formal patterns and input/output examples—and discuss indexing strategies, retrieval methods, and ranking techniques. Their survey addresses real-world search behaviors and emphasizes open challenges, such as supporting version history, cross-language search, and more robust evaluation frameworks.

Zheng et al. [47] review 149 studies, comparing those LLMs trained/fine-tuned on code like Codex, CodeLLaMA, and AlphaCode against general LLMs using HumanEval and Automated Programming Progress Standard (APPS) benchmarks. Their findings show that code-specific models typically outperform general models using metrics like Pass@k and BLEU. Zan et al. [48] evaluate 27 LLMs for NL-to-code tasks, identifying three key performance drivers: model size, access to premium datasets, and expert fine-tuning. However, they note an overemphasis on short Python snippets, limiting generalization. Wong et al. [49] survey the impact of LLMs trained on “Big Code” in integrated development environment (IDE), examining their ability to support tasks like code autocompletion and semantic comprehension. They also caution about drawbacks such as model bias, security concerns, and latency that may hinder real-time applicability. Similarly, Zheng et al. [50] focus on Transformer-based pre-trained language models and LLMs with parameter size greater than or equal to 0.8B, across seven core tasks such as test generation, defect detection, and code repair. These models perform well in syntax-aware tasks but struggle with semantic understanding. Their review also underscores fragmented benchmarks, limited error analysis, and insufficient attention to ethical and cost-related concerns. Xu and Zhu [51] offer a deep dive into pre-trained langauge models like CodeBERT, GraphCodeBERT, and CodeT5, detailing their distinct learning objectives and use of input structures such as AST and data-flow graph (DFG). However, they note issues including scarce multilingual datasets and underexplored graph-based techniques.

Some surveys have also explored application of AI and its variant for SE life-cycle. Crawford et al. [21] review the integration of AI into software project management, tracing its progression from early expert systems and rule-based planners to contemporary neural network applications. Their findings highlight AI’s value in improving estimation accuracy, risk management, and analytics for agile workflow. However, persistent challenges remain, including data privacy concerns and the lack of interpretability in AI models, both of which hinder stakeholder trust and limit industry adoption. Batarseh et al. [52] map AI methods to five SE phases, finding active prototyping in testing and requirements engineering, yet limited industrial validation and evaluation standardization. Durrani et al. [19] take a survey on broad perspective across domains of SE phases. However, their treatment of SE tasks and overreliance on quantitative summaries mined from AI tools (e.g., dimensions.ai) detracts from the unique advances in neural methods specific to programming. Their focus on older models like random forests overlooks the transformative capacity of modern neural models that now even support complex program generation and logical reasoning.

In contrast, our survey concentrates on programming-related SE tasks. While some documentation or design tasks may benefit from lightly fine-tuned general NLP models, we focus on how neural-based architectures are designed to understand, generate, and interact with source code. The limitations of prior surveys—such as narrow benchmarks, limited real-world validation, and fragmented methodologies—underscore the need for a unified, code-centric view of modern neural-based applications in SE. Table 1 compares existing survey and review articles. It outlines each prior work’s main focus, key strengths, identified limitations, and basis of validation, allowing readers to quickly compare how existing approaches relate to ours. Based on our review of previous survey studies, we observe that prior works on the application of neural methods and general AI/ML techniques for software engineering and programming tasks can be broadly categorized into four groups as summarized in Table 2.

Our survey distinguishes itself from prior reviews in several key ways. Unlike earlier works that narrowly focus on specific deep learning techniques or isolated SE subdomains, our survey systematically covers almost all types of neural methods across a broad range of programming-centric SE tasks. Additionally, we compare the application of neural methods in both PL and NL, highlighting representative models used in each context. Furthermore, our work stands out by explicitly comparing rule-based, statistical, and neural approaches, offering a holistic understanding of how neural models outperform earlier paradigms as detailed in Section 7. Overall, our survey adopts a broader perspective by integrating both qualitative synthesis and quantitative analysis, examining over 250 papers across 18 distinct programming task domains. We also provide trend-based insights on publication growth, dataset and benchmark usage, and model performance metrics as described in Section 6.

3. Scientific Paper Selection Process

Our paper selection process is visually summarized in Figure 2. In this study, we have collected and analyzed more than 250 research papers, exploring their primary objectives, contributions, and future research directions. Most of these works were published between 2015 and 2025, covering a decade of intensive research on neural methods for programming tasks. A few earlier foundational studies, dating back to the late 1990s, were also included to provide historical continuity and contextualize recent advances as depicted in Figure 3.

Our investigation delves into various applications of neural methods in programming tasks, including advanced code representation techniques that enhance automation in software development [59]. Table 3 provides an overview of the digital sources in which these papers are published, while Figure 3 highlights that recent years have witnessed a significant increase in publications within this domain.

The methodology for selecting research papers in this survey aligns with established approaches, such as those employed in [60]. Specifically, we manually defined key phrases and search terms related to various programming tasks to retrieve relevant articles from online digital libraries. Our approach to identifying relevant literature was methodical and targeted. We focused on search phrases involving neural approaches within the context of programming and programming-centric SE. Specifically, we frequently used keyword combinations that included terms like “neural methods”, “neural models”, and “neural networks”, paired with common prepositions followed by the phrases “programming tasks”, “programming”, and “software engineering”, as well as specific code-related task names. The primary search patterns are shown below:

(“neural methods” | “neural models” | “neural networks”) (“in” | “for”) (“programming tasks” | “programming” | “software engineering”)
(“neural methods” | “neural models” | “neural networks”) (“in” | “for” | “on” | “and” ) [specific task]

where “|” stands for “or”, and [specific task] is a placeholder for areas such as code translation, code generation, code clone detection, code summarization, program synthesis, and other programming-related tasks.

We also applied inclusion and exclusion criteria to refine our selection. Only studies written in English were considered. Furthermore, we specifically included studies that explore the application of neural methods within programming-centric software engineering tasks. Conversely, studies focusing on the reverse, such as applying software engineering techniques to improve AI systems or those discussing neural methods without a clear application to software or programming tasks, were excluded from our review. After the initial collection, we refined the selection by filtering articles based on specific attributes to ensure alignment with our research objectives and focus. The paper selection criteria predominantly include the following aspects:

Breakthrough research and SOTA approaches: Most selected articles focus on SOTA neural network architectures, including Transformer-based neural networks, encoder–decoder models, and other deep learning frameworks [61]. These studies provide cutting-edge advancements in neural-based code processing.
Recent publications and citation-based filtering: A significant portion of the papers have been published recently. To ensure that we include impactful research articles, we adopt a citation score-based filtering approach as in [60]. However, for studies published between 2022 and 2025, we prioritize the relevance of the topic without referring to citation counts.

To ensure the inclusion of impactful but recently published studies, we intentionally excluded citation counts as a criterion when selecting papers from 2022 to 2025. Many papers from this period were initially released as preprints on platforms like arXiv and had not accumulated substantial citations, even though they were later accepted by peer-reviewed venues in 2024 and 2025. Relying on citation metrics could have led to the under-representation of valuable studies that were still in the process of gaining visibility. Notably, key contributions—such as OpenAI’s GPT-4 and various iterations of Meta’s LLaMA models—were disseminated via arXiv rather than formal publication channels [62,63,64]. This reflects a broader trend toward early sharing in the AI community. To fairly represent the most relevant and innovative research from this evolving landscape, we prioritized the relevance and impact of contributions over citation numbers. During the filtering process, we first reviewed each paper’s abstract for relevance. If clarity was insufficient, we further examined the introduction, model architecture and research approach, experimental setup, and discussion sections.

In further illustrating the thematic relationships within the reviewed literature, Figure 4 presents a keyword co-occurrence network generated from the bibliometric analysis of the collected papers. The network highlights how keywords such as code, source code, learning, neural, and language models form the core of the research landscape, demonstrating strong associations with programming tasks including code translation, code generation, code summarization, and other programming tasks. These interlinked clusters reveal the growing convergence between neural methods, neural networks, code understanding and generation, and software engineering automation.

4. Neural Methods for Natural Languages vs. Programming Languages

Recent developments in large-scale NLP models have revolutionized human language processing [1,65]. Given the structural and lexical similarities between programming and natural languages, researchers have adapted neural-based NLP techniques to PLs [66]. PLs are essential tools for operating computing devices, enabling application and system development for programmers. Unlike NLs, which evolve through speech and cultural influences, PLs are designed by programmers to serve as a medium of communication between humans and machines. While NLs are analyzed through the norms and grammars of their speakers, PLs are based solely on the syntax, semantics, and grammar rules defined by their designers, making them a specialized subset of NLP. As such, the application of neural-based methods to PLs has benefited from advances in NLP-driven text processing. However, neural approaches tailored to PLs must account for significant differences from those used for NLs.

In terms of using neural methods for PLs, the following are remarkable opportunities we can exploit:

Simplified Linguistic Rules: PLs avoid challenges like pronunciation, accents (e.g., UK vs. US English), writing styles (e.g., formal vs. informal) [67], and text writing direction (e.g., Arabic vs. English), making automation easier than in NLs.
Abundant Open-Source Code Corpora: Repositories like GitHub and coding competitions offer large collections of code written in individual PLs, providing abundant monolingual data that enhance neural model performance on language-specific programming tasks.

On the other hand, there are also some pitfalls in using neural methods for PLs as shown below:

Limited Learning Modes: Unlike NLs, PLs are learned solely through reading and writing, preventing neural-based automation from leveraging voice or speech processing [68].
Grammar Sensitivity: PLs are highly syntax dependent, where minor errors, like missing whitespace in Python or misplaced semicolons in compiled languages, can cause the entire program to fail.

Therefore, the notable distinction between PLs and NLs lies in syntax and ambiguity: PLs are inherently unambiguous and structured to be deterministically parsed, whereas NLs often exhibit ambiguity due to diverse dialects and informal usage. This syntactic rigidity implies that code tokens follow strict formal rules, such as matching braces and consistent indentation that are not typical considerations in NLP tasks. As a result, code tends to be more predictable than human language, given that developers frequently reuse established idioms and avoid non-standard constructs. Consequently, neural models for code often integrate structural representations, such as ASTs or program graphs, rather than processing flat token sequences alone. For instance, GraphCodeBERT [69] enhances a Transformer-based architecture by incorporating DFGs alongside the sequence of tokens to capture variable dependencies and value propagation within the code.

While general-purpose neural language models like BERT [70] have been fine-tuned to apply for both code and NL domains, models trained explicitly for code typically include additional architectural components. These may involve embedding ASTs or data-flow structures using neural network with code-comprehending traits, or using specialized attention masks. Some models also augment input representations with tokens specific to programming constructs, such as language keywords or data types. By contrast, neural models in NLP often emphasize shallow parsing and focus on linguistic challenges such as co-reference resolution and semantic interpretation.

Hence, in this context, we review some prominent NLP models and compare their equivalent models trained for PLs. For example, the Salesforce’s CodeT5 model [71] was developed from Google’s T5 model. Both employ Byte Pair Encoding tokenization and follow an encoder–decoder architecture. However, while T5 was trained on the C4 corpus, CodeT5 was pre-trained on CodeSearchNet and fine-tuned using CodeXGLUE, enabling it to process both NL and source code. T5 has since evolved into t5x and seqio models. Similarly, CodeT5 has been extended into CodeT5+, incorporating objectives such as instruction tuning, text-code matching, span denoising, causal language model pre-training, and contrastive learning. Inspired by such development, large multilingual corpora combining natural and programming language tokens are being introduced in the field [72].

Additionally, models derived from BERT have demonstrated strong performance in both NLP and programming tasks. For instance, cuBERT [73] utilizes standard Python tokenization with Word2Vec-based fine-tuning for code classification. CoCLUBERT [74] enhances code clustering through objectives like Deep Robust Clustering, CoCLUBERT-Triplet, and CoCLUBERT-Unsupervised, which outperforms its cuBERT baseline model. Microsoft’s CodeBERT [75] employs bimodal Masked Language Modeling with NL and code, along with unimodal (replaced token detection) objectives. Several CodeBERT variant models like GraphCodeBERT [69], UniXcoder [76], LongCoder [77], and others address some programming challenges.

Cross-lingual pre-training and fine-tuning techniques [78] have further enhanced PL models adaptability. A key example is the unsupervised neural machine translation (NMT) approach introduced in [65], which employs Cross-lingual Masked language pre-training, Denoising Autoencoding (DAE), and Back-Translation techniques. This approach forms the backbone for Facebook’s (currently known as Meta) TransCoder models, which were designed for code translation [79].

Furthermore, the code summarization research in [8] has introduced mechanisms such as the “copy attention mechanism”, improving the quality of code summarization. Unlike its equivalent NLP-based text summarization model presented in [80], which relies on vanilla Transformers and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, the code summarization model integrates multiple performance metrics, including BLEU, Metric for Evaluation of Translation with Explicit ORdering (METEOR), and ROUGE.

A key distinction between neural models for NLs and PLs lies in their training data. Unlike NL corpora, source code cannot be treated as plain text. Therefore, adapting an NLP method for PLs requires specialized preprocessing, tokenization, and model architecture. These differences can help to distinguish why directly transferring NLP models to programming tasks frequently leads to suboptimal results. Neural architectures tailored for programming tasks, through the incorporation of structural features such as ASTs, control and data flow analysis, and type information, tend to outperform their general-purpose counterparts on tasks like code completion, summarization, and bug detection.

To complement the conceptual discussions above, Table 4 provides a comparative summary of neural models designed for natural and programming languages. It distinguishes their representative tasks, data sources, and evaluation metrics. The comparison highlights how PL models (e.g., CodeBERT and CodeT5) extend NL counterparts (e.g., BERT and T5) by incorporating structural information such as ASTs and Data Flow Graphs (DFGs), which enable improved performance in code-related tasks while introducing new challenges in cross-language generalization and semantic correctness.

5. Neural Methods for Programming and Code-Centric SE Tasks

In this section, we explore neural methods for programming-centric SE tasks in fine-grained detail, covering the research focus, outcomes, and future directions for each task. Table 5 shows an overall summary of the 18 programming tasks, including their main objectives related to utilizing neural methods, along with their open research issues. In selecting these tasks, we prioritized those most frequently addressed in the large volume of literature, capturing widely studied areas with consistent terminology. Tasks mentioned under variant names (e.g., code recommendation vs. code completion; vulnerability prediction vs. vulnerability detection; and static/dynamic analysis vs. code analysis) were mapped to our taxonomy. Unlike some of the prior survey papers, for instance, the review by Durrani et al. [19] where the analysis relies on automated indexing tools such as Dimensions.ai, our review uses a rigorous manual literature collection and researcher-driven analysis. Our approach, which encompasses 18 distinct programming tasks, highlights the multidimensional contributions of neural methods to the field of programming. We emphasize the evolving capabilities of neural models that are the backbone of code assistant tools such as GitHub Copilot, which have not yet been sufficiently acknowledged by recent survey work such as in [19].

5.1. Source Code Translation

Source code translation can be viewed in two ways. First, it refers to the migration of outdated versions of source code to a newer version within the same PL [81]. Second, it involves translating source code from one PL to another [7,79]. In a broader sense, code translation can encompass shifts between different programming paradigms (e.g., declarative to imperative or procedural to object-oriented). It may also involve porting Application Programming Interfaces (APIs) from one language to another [82]. Although early traditional tools for code translation have some limitations, they are cost effective compared to manual translation and provide benefits to businesses across various domains. Source code translation has applications across multiple fields, including high-tech software production, cyber-security, and more. The code translation utilization areas can be seen as follows (but not limited to):

Platform- or application-specific optimizations for high-performance computing applications [83], especially transitioning software from outdated to modern languages.
Upgrading software to secure, up-to-date, and well-documented PLs to enhance long-term maintainability and compliance.
Facilitating the innovation of new PLs by simplifying legacy code migration.
Advanced program analysis and verification [84].

Existing source code translation methods can be broadly classified into three categories: rule-based, statistical, and neural translation approaches. Additionally, for cases involving a small number of lines of code, manual translation can be a viable option. Similar categorical approaches are also found in other programming tasks, such as code generation, code search, code summarization, etc.

Table 5. An overall summary of programming tasks using neural models, their objectives, and corresponding open research issues.

Task	Objective	Open Research Issues
Automatic Code Edit	Auto-modification of code.	Stable benchmark datasets for evaluation.
Code Analysis	Evaluates structural and runtime behavior of source code.	Need for multiple representations for efficiency.
Code Authorship and Identification	Utilizes neural models to attribute code to developers.	Challenges in coding style variability, multi-language authorship, and AI-generated code identification.
Code Change Detection	Tracks commit updates in large-scale software development.	Potential for IDE integration with push notifications.
Code Classification	Categorizes code based on syntax, semantics, and algorithms.	Demands of expanding classification criteria.
Code Clone Detection	Identifies duplicate or near-duplicate code snippets to maintain software quality.	Demands of distinguishing code clone detection from code similarity detection.
Code Completion	Enhances productivity by auto-completing blocks of codes.	Privacy and security concerns.
Code Generation	Automates code creation using code-assistants.	Ensuring correctness and execution reliability.
Code Modeling and Representation	Uses various representations for better source code understanding.	Designing optimal representation strategies remains an open issue.
Code Search and Retrieval	Enables retrieval of relevant code snippets.	Safety and security of AI-searched code remain a concern.
Code Similarity Detection	Identifies duplicate code using neural networks.	Challenges in dataset reliability.
Code Summarization	Generates human-readable descriptions for code.	Evaluation is highly dependent on NLP metrics.
Code Vulnerability Detection	Uses various neural methods to detect security flaws.	False positives persist and need to integrate neural methods with static/dynamic analysis.
Comment Generation	Generate descriptive comments for code.	Need for multilingual support.
Decompilation	Converts machine-level code into high-level source code.	Need for improving generalization across languages.
Program Repair and Bug Fix	Improve program debugging and efficiency.	Integrating neural methods with program verification systems for runtime bugs.
Program Synthesis	Converts NL instructions into code using neural models.	Evaluation benchmarks remain an issue.
Source Code Translation	Migrates code across different versions of the same language or translates it between different languages [85].	Difficulties in ensuring semantic and functional equivalence between the source and translated programs.

Manual code translation involves an expert translating code between languages, ensuring both source and target languages meet functional requirements. While effective for small functions or short code snippets, this approach is impractical for larger codebases, as it is time-consuming, error-prone, and inefficient.

The rule-based source code translations [86,87], also known as conventional methods, are mainly dependent on the AST of the program’s source code [7]. However, there are also variations in the implementation of these code translation approaches.

In [83], the source code’s AST is converted to an eXtensible Markup Language (XML), with user-defined rules applied to optimize code translations based on platform-specific features. This research developed the Xevolver tool, built on top of the ROSE compiler infrastructure, which uses XML ASTs to enable code modifications between the source and target languages. Another example is the OP2-Clang tool [88], which utilizes the Clang/LLVM AST matcher to optimize parallel code generation. In [89], a rule-based technique using ASTs is also applied in security and code optimization for Java programs, resulting in an Eclipse plug-in tool.

Statistical methods in source code translation draw from techniques used in statistical machine translation (SMT) for NLP [90]. These approaches comprehend the source code as lexical token sequences, applying statistical language models to map these tokens between source and target languages. For instance, in [90], the source code is modeled using SMT, achieving high BLEU scores when translating Java to C#. Similarly, in [91], a phrase-based language model is applied to translate C# to Java, using a parallel corpus of 20,499 method pairs. In [92], the Semantic LAnguage Model for Source Code (SLAMC) incorporates semantic information, such as token roles, data types, scopes, and dependencies, to improve SMT-based code translation quality.

Neural networks, particularly neural language models, have greatly advanced source code translation. Numerous neural language models leverage fine-tuning techniques to excel in code-related tasks. Models like CodeBERT and CodeT5 fine-tuned for such tasks, have achieved SOTA performance [71,75].

Facebook’s TransCoder [79] employs unsupervised learning to translate between Python, Java, and C++, using monolingual word embeddings, DAE, and BT. Its self-trained extension, TransCoder-ST [93], incorporates automated unit test generation to preserve semantics during code translation. Unit tests play a crucial role not only in code translation but also in a wide range of programming tasks. As such, they are essential for self-supervised neural models to validate the functional correctness of their predictions [94]. PLBART [95] is another versatile model excelling in multiple code-related tasks, including code translation. It outperforms models like RoBERTa and CodeBERT in Java-to-C# translation based on evaluation metrics of BLEU and CodeBLEU [96].

Despite these commendable progress, challenges remain. Traditional NLP metrics like BLEU fail to capture semantic correctness in PL codes, as variations in coding style can yield low scores despite identical functionality. Effective evaluation must consider both syntax and semantics of PL code. Additionally, code translation models are often treated as a supplement to other tasks. A dedicated data pipeline and specialized neural models are needed to refine the defects that arise during code translation and to address the unique complexities of source code translation.

5.2. Code Generation

Automated code generation is crucial for various programming tasks, with diverse demands leveraging generated code’s resources [97]. Some methods use code documentation text for code generation, such as doc-strings of Python functions [98], while others integrate back-translation heuristics.

CodeT5 [71], a code model built on top of Google’s T5 model [99], incorporates NL and PL tokens, enabling bimodal (NL-PL) and PL-only training for task-specific code generation and understanding.

ASTs play a key role in structured code generation. Abstract Syntax Description Language-based semantic parsing [100], Neural Attribute Machines [101], and other research works have leveraged the AST-based structural representations of a program for code generation. Another approach stated in [102] utilizes an API-driven Java program generation, which leverages combinatorial techniques.

CoDET, short for CODE generation with generated Tests [103], evaluates correctness via test-case based code coverage, although high coverage alone does not guarantee correctness. Execution-based testing such as mutation score evaluation [104] and others offer a more reliable alternative to ensure the correctness of the model-generated codes.

Although improvements have been made over time, generated code often requires manual modifications, and ensuring error-free execution remains an ongoing challenge [105]. Code LLMs enhance efficiently the automation of code generation [106,107]; however, executability issues persist.

5.3. Comment Generation

Code comments in general-purpose programming languages provide a high-level understanding of source code written by developers. Neural models can generate descriptive comment texts for a given program [108].

An empirical study in [109] evaluates T5 and n-gram models in code comment completion tasks. The experimental demonstrations in [110] train a model that learns source code along with its comments written in Japanese NL texts, in which it distinguishes between the code and comments using procedure learning. In this research, LSTM is used for comment generation, leveraging problem statements written in the NL texts to improve program understanding. While the research work in [111] attempts automatic comment generation using NMT models, the performance is suboptimal.

In [108], the Deep Code Comment Understanding and Assessment (DComment) model is proposed to understand and classify generated comments based on their quality. Some studies explore code-to-comment translation [112] as an alternative, while evaluating the generated comments’ quality using NLP metrics. However, performance evaluation issues are still concerning as highlighted in Section 6.2.

Most comment generation models produce English-based comments, which can be a barrier for non-English-speaking programmers. Developing models that can generate comments in a multilingual fashion could enhance collaboration in software development.

5.4. Decompilation

Decompilation reverses compilation, converting machine-level code (binary or assembly code) into high-level source code. Traditional rule-based methods are costly with limited performance, prompting exploration of neural methods as alternatives. The research in [10] trains RNNs on machine code compiled from C, which is later extended into a two-phase approach [113]: generating code snippet templates and then populating them with appropriate identifier/variable values.

Neutron [114] applies attention-based NMT techniques for decompilation, while BTC (Beyond the C) [115] offers a platform-independent neural decompiler supporting multiple languages, including Go, Fortran, OCaml, and C. As reported in [116], the researchers have explored the intermediate representations (IR) of program codes with seq2seq Transformers for decompilation.

Most existing tools remain rule-based and limited to specific languages, restricting their effectiveness. While a one-fit-for-all decompiler may be infeasible, future research should focus on leveraging efficient neural models to enhance the generalization of this task across diverse languages.

5.5. Code Search and Retrieval

Developing software from scratch is time-consuming, which requires software engineers to integrate open-source code. However, effective code retrieval requires robust search tools. While traditional search engines excel at NL queries, they struggle with accurate code snippet search. Neural methods offer a promising alternative.

Code-Description Embedding Neural Network (CODEnn) [117] is a deep learning-based code search model that leverages RNN-based sequence embeddings for joint source code and description representations. CARLCS-CNN, short for Co-Attentive Representation Learning Code Search-CNN [118], enhances code search with co-attention mechanism and CNNs to improve search accuracy. The Code-to-Code Search Across Languages (COSAL) model proposed in [119] enables cross-lingual code search using static and dynamic analysis, while Deep Graph Matching and Searching (DGMS) [120] employs Relational Graph Convolutional Networks (RGCNs) and attention mechanisms for unified text-code graph matching.

Traditional information retrieval techniques have also been enhanced with neural approaches. For instance, the study in [121] integrates Word2Vec into retrieval tasks. The Multi-Modal Attention Network (MMAN) in [122] utilizes LSTM, Tree-LSTM, and Gated Graph Neural Networks (GGNNs) to capture syntactic and semantic information. In [123], the authors combine CNNs and joint embeddings for Stack Overflow queries. The Multi-Programming Language Code Search (MPLCS) model proposed in [124] extends retrieval across multiple PLs, benefiting low-resource PLs.

AI-driven tools like Microsoft Copilot and ChatGPT [3] have revolutionized code search, retrieving relevant snippets via NL prompts. Unlike traditional engines, ChatGPT bypasses irrelevant search results but still may return non-executable code. Future improvements could integrate SE tools with neural models to search a code which is ready-made for execution with no errors, enhancing usability and reliability in software development.

5.6. Code Completion

IDEs with auto-completion features significantly enhance software development productivity [125]. Various neural methods have been explored for code completion and related tasks. In [126], LSTMs with attention mechanisms were used for code completion and other group of coding tasks. The study in [66] proposes a pre-trained Transformer-based model with multitask learning to predict code tokens and their types, which is finally fine-tuned for efficient code completion.

Microsoft’s IntelliCode Compose [127] is a cloud-based tool that suggests entire lines of code with correct syntax. This model provides monolingual embeddings while leveraging multilingual knowledge, benefiting low-resource programming languages. Additionally, it enhances privacy by preventing exposure of sensitive information, addressing a key limitation in prior auto-completion tools.

Meta’s research work in [128] demonstrates the effectiveness of transfer learning for code auto-completion. Their approach is built on top of the GPT-2 and BART Transformer-based models, and applies auto-regressive and DAE objectives to improve predictions.

A critical concern in code completion is privacy and security as noted in [127]. Furthermore, since prompts used for code completion tasks often have similar structures, there is a risk that generated programs will share similar designs, potentially introducing common vulnerabilities. Future research should address these concerns by enhancing model diversity and privacy safeguards in automated code completion. Additionally, most current code auto-completion assistants struggle to retain the long-context of earlier parts of the code written by the programmer. As a result, they often suggest generic and repetitive line completions. To enable more intelligent and context-aware assistance, it is essential to develop code completion systems capable of capturing long-range dependencies, potentially across an entire file or even multiple files within a project. Such capabilities are critical to providing seamless and relevant code suggestions.

5.7. Automatic Code Edit

Editing and fixing awkward parts of code structures are routine yet time-consuming tasks for programmers. These repetitive activities consume significant effort, but recent neural models aim to automate them [129].

CodeEditor [130] is a pre-trained code editing model whose results demonstrate improved performance and generalization capabilities in code-editing tasks. It consists of three stages: collecting code snippets from repositories like GitHub, generating inferior versions (versions somewhat different from the ground truth) of the code for pre-training, and evaluating the pre-trained model under three settings, namely fine-tuning, few-shot, and zero-shot.

The multi-modal NMT-based code editing engine referred as MODIT in [9] processes code in three steps: preprocessing, token representation via an encoder–decoder attention mechanism, and output generation. The preprocessing phase integrates three input modalities: code location, contextual information, and the commit messages, which guide the editing process.

A major challenge in this field is the lack of stable benchmark datasets for source code editing. The exponential growth of demands on software development further compounds this issue, as modern applications span huge numbers of lines of code, making manual code editing tiresome. For automatic code-editing models to be effective and scalable in complex coding environments, they should be trained on diverse and inclusive source code corpora. Addressing this gap is crucial for advancing automated code editing techniques.

5.8. Code Summarization

Programmers frequently read source code written by others, requiring the detailed intuition of a particular program. Automating this process through code summarization, that is, generating human-readable descriptions for code snippets, can significantly enhance efficiency. Neural methods have shown great success in this area.

M2TS, short for Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization [131], utilizes AST structures and token semantics using a cross-modality fusion approach for improved code summarization. Another model in [132] applies reinforcement learning with triplet code representations such as CFG, AST, and plain texts to enhance code summary generations.

In [133], a hybrid model combines token sequences (encoded using c-encoder) with semantic graphs (encoded using g-encoder) for richer context. Structural information from code snippets has inspired research such as AST-Trans [134], which transforms AST graphs into sequences using traversal and linearization techniques, thereby reducing computational costs. In [135], the model predicts action words from code blocks, identifying their intent and job class to enhance summarization.

Although significant improvements have been achieved in the research area of code summarization, evaluating the performance of code summarization models remains challenging. NLP-based metrics fail to capture source code semantics, often leading to misleading results. Research into specialized evaluation metrics tailored for source code is essential to ensure accurate and meaningful assessments of code summarization models.

5.9. Code Change Detection

Tracking updates in large-scale software projects is challenging, especially for a software development distributed across multiple teams, which may require frequent commits and updates from each team. Neural methods have been introduced to assist programmers in efficiently managing code changes.

CORE, short for COde REview engine [136], employs LSTM models with multi-embedding and multi-attention mechanisms to review code changes, effectively capturing semantic modifications. Another model, CC2Vec [137], represents code changes by analyzing modifications across updated source code files. It preprocesses change patches by tokenizing modified code blocks and constructing a code vocabulary. Structural information from added and removed code is then processed via a Hierarchical Attention Network for better representation.

IABLSTM (short for Impact Analysis with Attention Bidirectional Long Short-Term Memory) [138] utilizes a Bidirectional Long Short-Term Memory (Bi-LSTM) with an attention mechanism to detect source code changes. It identifies differences between the original and modified code using cosine similarity, leveraging AST paths and vectorized representations. Another approach in [139] applies NMT techniques to track meaningful code modifications in pull requests, aiding collaborative development.

Along with the previous efforts, code change detection remains an active research area. A promising direction is integrating these models with push notification systems within IDEs, providing real-time alerts about modifications. This would enhance efficiency in team-based software development, ensuring developers stay informed seamlessly.

5.10. Code Similarity Detection

The rising demand for software applications has led to increased source code duplication, where the same code is reused across projects. This issue undermines creativity, infringes upon intellectual property rights, and raises privacy concerns. Neural methods have emerged as promising solutions for detecting code similarity.

The model in [11] applies NLP techniques for code similarity detection. It preprocesses source code through stemming, segmentation, embedding, and feature extraction, generating vectorized representations of code pairs. Cosine distance is then used to compute similarity scores. Another approach in [140] focusing on Scratch, a visual PL, utilizes Siamese Bi-LSTM model to capture syntactic and semantic code similarities. Manhattan distance metric is utilized to evaluate similarity between the Scratch files.

Cross-language code similarity detection is another active research area. The approach in [141] transforms source code of multiple PLs into control flow charts, applying graph similarity detection techniques to compare them.

A major challenge in code similarity detection research is the reliability of datasets. Open-source repositories may contain default IDE-generated templates, leading to unintentional similarities. Future research should focus on differentiating manually written and machine-generated code to enhance code similarity detection. Addressing these challenges is crucial for maintaining ethical and professional standards in software development.

5.11. Program Synthesis

Ideally, computer-literate end users should be able to interact with machines using clear NL instructions. However, for this to be possible, intelligent systems are needed to convert NL instructions into executable code. Program synthesis [142,143] addresses this challenge by developing models that automatically generate programs based on user-defined instructions.

PATOIS [144] is a program synthesis model which incorporates code idioms via structural components called probabilistic Tree Substitution Grammars (pTSG). Its encoder embeds NL specifications, while its decoder generates token representations of ASTs. LaSynth [145] is an input–output-based program synthesis model, focused on compiled languages like C, by integrating latent execution representations. During training, its loss function combines latent executor loss and token prediction loss. The model also generates input–output pairs to improve supervised program synthesis approaches.

CodeGen [146] is a Salesforce’s Code LLM, which employs an auto-regressive method similar to GPT models, predicting code tokens based on prior prompts. Its multi-turn programming benchmark is designed to scale problem sets according to the size of the model and dataset.

However, program synthesis model evaluation benchmarks remain an open challenge. Some studies propose synthetic datasets [147], while others generate domain-specific input–output examples [148]. Further research is needed to establish standardized and versatile evaluation benchmarks, ensure the executability of source code generated by program synthesis, and develop optimal performance evaluation metrics.

5.12. Code Modeling and Representation

This part of our survey presents existing approaches to code representation and modeling in neural-based programming tasks [149]. Effective programming tasks require robust code representation and modeling techniques [150], as these representations directly influence how neural models interpret and process source code.

Several studies have explored different code representation techniques. In [151], source codes are utilized as pair-wise AST node paths, offering a generalizable approach. Another work [152] employs IR and contextual flow graphs (XFG) to enhance semantic code representation. Graph-based methods have also gained attention in the code representation research. Message passing and grammar-based methods in [153] represent the semantics of code structurally. GraphCodeBERT [69] leverages data flow structures to model relationships between variables and computational flow. The Open-vocabulary neural language model (Open Vocab NLM) in [154] integrates Gated Recurrent Units (GRUs) and sub-word tokenization, enabling it to process billions of tokens dynamically. AST-based approaches are evolved in [155], where ST-trees mitigates long-range dependency issues using bidirectional GRUs. Flow2Vec [156] converts code to low-dimensional vector representations using high-order proximity embeddings. CodeDisen [157] leverages a variational autoencoder to disentangle syntax and semantics across PLs. SPT-code [158] incorporates code sequences, ASTs, and NL descriptions for tasks like summarization, completion, bug-fixing, translation, and code search. Another model in [116] aligns embeddings of different PLs augmented along with their IR representations and actual tokens of the source code for the code translation task.

This survey highlights a wide range of representations, from raw source code tokens, ASTs, and CFGs to IR and DFG. While combining various code representations often improves comprehension, systematic research is needed to determine the optimal combination of these possible code representation options. Enhancing the neural-based code-modeling relies on robust, task-specific representation strategies. Hence, exploring the effective augmentation of various code representation alternatives could significantly broaden the research in this field.

5.13. Code Classification

Understanding and identifying source code across multiple PLs requires intuitively grasping the objectives and distinguishing features of programs. Traditional methods, such as manual cross-checking or using predefined rules, are effective but impractical for large-scale projects, especially for developers unfamiliar with multiple PLs. Code classification offers an efficient solution to these challenges by categorizing source code based on syntax, semantics, or other defined characteristics.

In [159], Multinomial Naive Bayes (MNB) is used as the classifier, and the Term Frequency-Inverse Document Frequency (TF-IDF) is employed for feature extraction. This method classifies code snippets from 21 PLs using the Stack Overflow corpus and identifies specific language versions. Deep learning has also further improved classification accuracy. For example, Reyes et al. [160] employs an LSTM with word embeddings and dropout regularization, outperforming traditional classifiers. The research in [161] uses CNNs to classify source code across 60 PLs, achieving high F1 scores. A CNN-based approach in [162] predicts source code categories based on algorithmic structures rather than keywords, using an online judge system as its corpus. In [163], the authors leverage topic modeling and JavaParser preprocessing to identify code block functionalities for Java code classification. Beyond standard code classification tasks, Barr et al. [164] combine deep learning with combinatorial graph analysis to detect and classify code vulnerabilities, using code2vec and LSTM embeddings. The research work in [165] classifies semantic units of machine learning code using LSTM-GRU with attention mechanisms, applying MixUp augmentation to overcome data scarcity.

Code classification methods in the papers reviewed so far can fall into four broad categories such as algorithm or problem domain based, language based, code functionality based, and vulnerability based as shown in Figure 5. However, additional criteria such as coding paradigms (imperative, declarative, functional, and object-oriented) can be a further supplement to code classification models. Future research should focus on expanding classification taxonomies, ensuring a comprehensive categorization basis for an all-inclusive code classification task.

5.14. Code Vulnerability Detection

Source code vulnerabilities arise when errors and security flaws remain unpatched, making automated detection essential for mitigating risks. In [166], CNN and RNN deep learning models are utilized to extract feature representations and detect vulnerabilities in C and C++ code.

The Automated Vulnerability Detection framework based on Hierarchical Representation and Attention Mechanism (AVDHRAM) proposed in [167] improves vulnerability detection by structuring source code into five hierarchical levels: program, function, slice, statement, and token. However, the primary focus of this model is statement-level analysis enhanced by attention mechanisms. Another approach by Mao et al. [168] introduces a Bi-LSTM model that serializes ASTs while incorporating attention mechanisms to improve accuracy in classifying vulnerable functions. Transformer-based vulnerability detection presented in [169] uses fine-grained code slices for analyzing function calls, pointers, and expressions. In [170], three deep learning models combined with multiple code representations improve vulnerability detection.

Graph-based models like the one in [171] employ GGNN and inductive learning, using Common Vulnerability and Exposure (CVE) functions to create code slice graphs. In this work, the attention layer effectively identifies relationships among graph nodes, enhancing vulnerability detection. Wang et al. [172] combine gated graph networks and control flow analysis to distinguish benign from buggy code, addressing data scarcity by leveraging training data from open-source repositories.

Although some remarkable findings are undeniable in this area, vulnerability detection models still face challenges such as false positives and limited generalization to unseen flaws. Identifying vulnerabilities alone is insufficient; thus, classification according to severity, type, and potential impacts is crucial for comprehensive vulnerability management. Integrating neural models with static and dynamic program analysis or formal verification [173] could improve robustness. Future research should explore these integrations experimentally, advancing comprehensive vulnerability detection methodologies.

5.15. Code Analysis

Source code analysis involves multiple stages, serving various purposes and applications [174]. Prior research has explored different attributes for code quality assessment. In [175], static and dynamic properties of CFGs are utilized for code analysis. Ramadan et al. [176] perform code analysis by evaluating the execution speed of a particular program based on AST structures, generating code pairs to identify the snippet with faster execution speed.

A dynamic program analysis algorithm proposed in [177] leverages AST and CFG representations, storing contextual information in Translation Units. In [178], a hybrid analysis tool combines historical and structural schemes, featuring a Java-based parser, a Python-based Code History Miner, and an interactive interface. GENESISP, introduced in [179], integrates source and binary code data for Debian-based environments, employing 13 analysis tools. Though its neural basis is unclear, its user interface facilitates database collection and analysis of open-source software.

Although the current research in source code analysis is commendable, most neural models rely heavily on AST representations. While ASTs capture structural aspects, they fail to convey comprehensive functional and semantic information about the code. Therefore, integrating multiple representations alongside AST information can enhance the robustness of source code analysis.

5.16. Code Authorship Identification

With growing concerns over intellectual property in software development, accurate code authorship identification is essential for innovation and guaranteeing ownership rights [180]. Several studies have explored neural-based approaches to tackle this problem.

Abuhamad et al. [181] introduce a CNN-based method utilizing word embeddings and TF-IDF representations, demonstrating promising authorship attribution results on the Google Code Jam (GCJ) dataset. Kurtukova et al. [182] propose a hybrid neural network combining CNN-Bidirectional GRU (C-BiGRU), LSTM, Bi-LSTM, and other models, training on vectorized datasets before applying them to anonymized source code for authorship identification. Expanding on this, in [180], they tackle more complex real-world scenarios such as multi-language authorship attribution, coding style variations, and AI-generated code identification. Omi et al. [183] introduce a model capable of identifying multiple contributors to a single codebase by converting code snippets into AST paths and using ensemble classifiers. A CNN-based model, along with a KNN classifier, is presented in [184], emphasizing explainability in authorship attribution.

A key challenge in authorship attribution is the variability in coding styles across different PLs. Programmers often write in multiple styles, complicating the identification task. Moreover, IDEs generate code templates, such as ASP.NET MVC scaffolding in Microsoft Visual Studio, and AI-assistants can also produce large amounts of code, making it difficult to distinguish programmer-written and machine-generated code. Although progress has been made, future work should address the evolving programming practices, mixed data sources, and auto-generated code for robust authorship attribution and identification.

5.17. Program Repair and Bug Fix

Troubleshooting errors and fixing bugs in source code are daily challenges for every programmer. Manually handling these tasks for large codebases is cumbersome, necessitating automated tools. Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair (ENCORE) is proposed in [185], leveraging a seq2seq encoder–decoder architecture with an attention mechanism and ensemble learning. The model follows three stages: input representation (tokenization), training ensemble models, and validating patches. ENCORE successfully repairs code errors in four PLs and shows potential adaptability for more PLs. Liu et al. [186] address lexical and semantic gaps between bugs and their corresponding fixes by employing lexical and semantic correlation matching, combined with focal loss, to tackle data imbalance in buggy and non-buggy classifications.

While most SOTA approaches fine-tune the pre-trained NLP models for programming tasks, Jiang et al. [187] pre-train a pure PL model on large codebases, then fine-tune it for APR. Their approach introduces a code-aware beam search strategy to manage syntax constraints and sub-word tokenization to mitigate out-of-vocabulary (OOV) issues. CIRCLE (Continual Repair aCross Programming LanguagEs) [188] is a cross-language APR model, which extends T5-based pre-trained NLP models with continual learning to repair code across multiple PLs. This approach enhances adaptability and generalization.

In real-world software development, bugs are often fixed following the post-failure measures. However, integrating self-healing mechanisms [189] with bug detection and fault localization [190] could revolutionize APR by preventing data loss or system crashes. Despite their promise, LLMs and Code LLMs struggle with runtime bugs unseen during their training stage. Addressing this requires integrating LLMs with external validation systems that provide automatic feedback for invalid patches. Ongoing research explores patch validation and refinement [189], paving the way for end-to-end APR frameworks that enhance the entire software development life cycle.

5.18. Code Clone Detection

Most modern software applications rely heavily on code reuse across their development stage. While this practice accelerates software development and productivity, it can lead to issues such as code bloating and software quality degradation, collectively referred to as code clone problems. Code clone detection models address these challenges. For instance, the research work in [191] proposes a CNN-based model with two convolutional and pooling layers, leveraging the BigCloneBench dataset for evaluation of the model on code clone detection.

Zhang et al. [192] design a clone detection method that aligns similar functionalities between source code snippets, even when their structure is different. This model uses sparse reconstruction and attention-based alignment techniques, relying on similarity scores to detect clones. Meanwhile, Zeng et al. [193] focus on computational efficiency, significantly reducing runtime. They use a weighted Recursive Autoencoder (RAE) and process source code in two phases: feature extraction to generate program vectors and clone detection via Approximate Nearest Neighbor Search (ANNS). They use Euclidean distance as a similarity metric. Functional clone detection, which identifies functionally equivalent but different code implementations, is explored in [194]. This hybrid model combines sequential tokens, syntactical ASTs, and structural CFGs to represent the source code and is evaluated on Java-based clone corpora. Recent research has expanded to cross-language clone detection as demonstrated in [195], which identifies clones across multiple PLs.

While sometimes the term code clone detection overlaps with code similarity detection, they serve distinct purposes. Although code clone detection can be considered a subset of code similarity detection, many studies use these terms interchangeably which may encounter confusion for readers. Future research should clearly distinguish these two tasks while developing an effective clone detection strategy for large-scale projects. Finally, leveraging LLMs or Code LLMs in code clone detection tasks holds promise for enhancing accuracy and scalability. By adopting recently up-to-date methodologies and expanding cross-language code clone detection capabilities, future research can offer robust solutions and clarify the distinction between code clone detection and code similarity detection, thereby improving software quality and maintainability.

To offer a deeper analytical view, we further grouped neural methods across the programming tasks based on their architectural families and learning paradigms. Broadly, these include sequence-based models (e.g., RNNs, LSTMs, and GRUs), Transformer-based large language models (e.g., CodeBERT, CodeT5, and PLBART), and graph-based neural networks (e.g., GGNN and RGCN). Each group exhibits distinct strengths and limitations. Sequence-based models excel at capturing short-range dependencies but struggle with long-contextual understanding; Transformer-based architectures handle long-range dependencies effectively, but they demand extensive computational resources and training data; graph-based approaches model code structure more explicitly but often sacrifice generalization due to task-specific designs. Table 6 summarizes the overall comparison of these neural methods’ categories in the context of programming-centric SE tasks.

6. Data, Benchmarks, and Evaluation Metrics

6.1. Datasets and Benchmarks

The scientific literature on the applications of neural networks in programming and SE primarily focuses on model designs and architectural advances. Neural networks form the foundation of deep learning and large parameter size models. The emergence of LLMs and Code LLMs, with a large number of parameters, marks a significant milestone in the field. However, a review of publication trends reveals that datasets and benchmarks often receive less emphasis and are treated as supplementary rather than core components of the research. This neglects the crucial role of data in the advancement of deep learning and AI, particularly in programming tasks, where high-quality training data is essential for model effectiveness.

In this context, we take a chronological deep dive into the major datasets and benchmarks introduced for programming tasks. One such example is the Django dataset introduced in [196], which is a parallel corpus of Python code and corresponding NL text descriptions. Salesforce’s encoder–decoder architecture-based CodeT5 model and a retrieval-augmented model kNN-TRANX were among the models evaluated on this dataset [197]. However, this dataset is small in size (18,805 pairs) and is narrowly scoped which ultimately support only the Python-related code generation tasks. Two years later (in 2017), a data corpus consisting of 108,726 Python Code and Doc-string pairs (PCSD) was released, where it incorporates class declarations, class methods, module doc-strings, and commit SHAs in its latest version [98,198]. The M2TS model for code summarization introduced in Section 5.8 is among the models evaluated on this dataset [131]. However, referring to the literature that explores the dataset, we observe that the PCSD data corpus has some noise, and the comments (doc-strings) miss some details.

The CoNaLa dataset is a benchmark comprising approximately 2.4 K curated pairs of English-Python data, extracted from Stack Overflow’s programming-related question–answer discussions and documentation [198]. Several neural models were evaluated on this dataset, such as CodeLLaMA, CodeT5+, and CodeGen among others [199].

Another notable dataset is Devign, presented at NeurIPS 2019 and consisting of 27,652 vulnerable and 31,313 non-vulnerable C language functions [200]. As the name suggests, this dataset is dedicated to identifying vulnerability patterns in C programs. It was manually curated from four popular open source C-language projects. The dataset is well-regarded for its quality in vulnerability identification tasks, as data collection and labeling were performed by a team of security experts, requiring approximately 600 man-hours.

Given that neural methods for programming tasks are still evolving, data processing methodologies are largely derived from NLP. Consequently, most of the code corpora utilized in scientific research articles focused on the application of neural methods in SE are mainly dominated by code generation using NL texts [201]. Notable examples include datasets mapping NL to SQL code [202]. In recent related studies, other open-sourced datasets have been designed and released for both NL-to-code and code-to-code generation tasks [203,204,205]. Semantic parsing is integral to the processing of the code corpus, while static parsing methodologies [206] aid in understanding structural code information.

Recently, numerous large-scale code-specific datasets have been released to boost the neural methods involved in the programming tasks [207,208]. For instance, in [209], a Python class-level code generation dataset containing 842K class skeletons (methods and attributes with doc-strings) has been released. This dataset was extracted from 13,174 real-world GitHub projects.

Online coding competition platforms [210], repositories like GitHub, and technical Q&A forums such as Stack Overflow [198] are key sources of code corpora. However, to date, many of the code corpora released by researchers and practitioners lack standardized dataset pipelines and detailed documentation. In response, commendable efforts have been made to provide well-designed datasets through the HuggingFace Datasets Hub [211].

However, the development of standardized dataset pipelines for programming tasks is still in its early stages. Although many processed code-related datasets are available via the HuggingFace Datasets API, several challenges arise when exploring datasets for the same task. For instance, datasets for tasks like vulnerability detection, code translation, or code generation may come from different contributors. These datasets often suffer from inconsistent data formats, varying feature sets, and undocumented preprocessing steps, making reproducibility difficult for future researchers.

To summarize, Table 7 provides a summary on a significant number of datasets used in programming and SE tasks, arranged in chronological order of their release. It provides a concise summary of each dataset, such as models evaluated on these datasets, targeted tasks, and evaluation metrics used to measure model performance. The critical analyses of various evaluation metrics along with their mathematical formulation are discussed in Section 6.2 below.

6.2. Evaluation Metrics

The progressive development of neural models for programming tasks, particularly those trained on code, is heavily influenced by NLP techniques. Even with the emergence of LLMs and Code LLMs, capable of interpreting code across various PLs, NLP-based evaluation methods remain widely adopted [49]. While not ideally suited for programming tasks, these metrics serve as a foundational baseline in the absence of standardized, code-specific evaluation frameworks. However, recently emerging code-specific evaluation metrics seem to be promising for measuring the rapidly growing range of programming tasks. Based on the evaluation strategies utilized in the major scientific studies reviewed in this survey, the metrics commonly employed to assess model performance in programming tasks can be categorized into five major groups as detailed below.

6.2.1. Text-Match and N-Gram Overlap-Based Metrics

Match-based or n-gram overlapping metrics assess model performance by quantifying the surface-level word matching (n-gram overlap) between model-generated outputs and ground-truth reference outputs. One of these metrics is the Exact Match (EM), which allows no deviation between the generated and reference code. Due to its rigidity, EM is generally not recommended for code generation tasks [215].

A more widely adopted metric is BLEU, originally developed for machine translation [216]. BLEU has been applied to programming tasks such as code translation and summarization. However, it overlooks key syntactic and semantic characteristics inherent in PLs.

ROUGE is another metric in this group, primarily used for evaluating code summarization and comment generation. It relies on matching the longest common subsequence between the generated and reference outputs [217]. However, like BLEU, it fails to account for semantically equivalent yet syntactically different code snippets.

To address such limitations, CodeBLEU incorporates both syntactic and semantic information through AST matching and data-flow matching [96]. Another alternative evaluation metric is CrystalBLEU, which improves upon BLEU and CodeBLEU by ignoring trivial n-gram overlaps that may result in a misleading match between code snippets with different logical purposes. CrystalBLEU has shown efficiency improvements ranging from 1.9 to 4.5 times over BLEU and CodeBLEU and also outperforms other BLEU variants such as Ruby [218,219]. METEOR, another widely used metric in this category, incorporates synonymy and stemming and has been adapted to certain code evaluation tasks [220].

Some of the core formulations for these metrics are provided below.

Assume that N denotes the maximum n-gram length (typically 4);

w_{n}

, which is commonly computed as

1 / N

, represents the weight for each n-gram precision

p_{n}

;

p_{n}

is the modified n-gram precision; and

BP

is the brevity penalty. The BLEU score is computed as follows:

BLEU = BP \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n})

BP = \{\begin{matrix} 1 & if c > r \\ exp (1 - \frac{r}{c}) & if c \leq r \end{matrix}

where c and r denote the lengths of the candidate and reference outputs, respectively.

The CodeBLEU score integrates semantic and structural signals. Assuming

{BLEU}_{weight}

denotes the token-wise weighted n-gram precision,

{Match}_{ast}

and

{Match}_{df}

compute the syntactic and semantic matches, respectively, and

α, β, γ, δ

are tuning hyperparameters, the CodeBLEU score is computed as follows:

\begin{matrix} CodeBLEU = & α \cdot BLEU + β \cdot {BLEU}_{weight} \\ + γ \cdot {Match}_{ast} + δ \cdot {Match}_{df} \end{matrix}

Despite improvements, all these metrics primarily evaluate surface-level match and lack insight into the actual execution or correctness of the model-generated code.

6.2.2. Embedding-Based Similarity and Learned Metrics

Embedding-based and learned metrics aim to capture semantic similarity between generated and reference code by comparing their vector representations or contextual embeddings. These metrics are well-suited for tasks such as code summarization [221], clone detection, code generation, and code search.

A foundational metric in this category is Cosine Similarity (Embedding CosSim), which ranges from

- 1

to

+ 1

and measures the angular similarity between two vectors (e.g., P and Q):

cos (P, Q) = \frac{P \cdot Q}{∥ P ∥ ∥ Q ∥}

where

\cdot

denotes the dot product and

∥ \cdot ∥

is the vector norm. Cosine similarity has been widely used in assessing various models like CodeT5+ and GPT-series models. Recent evaluations of these models involve Verilog code generation and comprehension tasks among the notable examples [222].

CodeBERTScore, built upon cosine similarity, utilizes pre-trained CodeBERT embeddings to compare code/comment pairs and has demonstrated high alignment with human judgment [223]. However, its applicability is restricted to a few supported PLs.

Euclidean distance is another metric, defined for two vectors P and Q as shown below:

d (P, Q) = \sqrt{\sum_{i} {(P_{i} - Q_{i})}^{2}}

Manhattan distance, also known as the L1 distance, is defined as

d (P, Q) = \sum_{i} | P_{i} - Q_{i} |

These two distance metrics are often employed in code similarity and clone detection tasks where structural information is less critical.

Another important metric is MRR commonly used in code search and retrieval-augmented generation applications. MRR computes the average of the inverse ranks of the first correct result and has been widely adopted in evaluations on CodeSearchNet [224].

BLEURT, a learned regression-based metric initially introduced for NLP evaluation, has recently been adapted for code summarization tasks [225]. Other metrics in this category include BARTScore, GPTScore, and NDCG, each tailored for specific code-related tasks.

Despite their advantages, these metrics face the following limitations:

They are unsupervised and depend heavily on the quality of the embeddings.
They often treat code and comments as plain text, disregarding structural and semantic intricacies such as AST nodes and API hierarchies.

Hence, while embedding-based and learned metrics enhance evaluation by moving beyond surface-level token matching, the growing complexity and diversity of programming tasks suggest that structure-aware, execution-based metrics may provide more accurate and holistic assessments.

6.2.3. Classification Metrics

Classification metrics broadly span from binary classification to multi-class and sequence-level accuracy. They are used to evaluate the correctness of classifiers on specific code-related tasks within the context of this survey. The values of classification evaluation metrics typically range from 0 to 1, or 0 to 100%. These evaluation metrics such as accuracy, precision, recall, F1, and AUC (Area Under the ROC Curve) are widely recognized as standard and reliable classification metrics not only in programming and code-related tasks but also across various other research domains involving classification activities. Several tasks such as code clone detection, code smell detection, source-code classification, and code vulnerability detection depend on such evaluation metrics [166,200,226,227,228,229,230].

Most classification metrics are derived from the following four fundamental attributes of a classification model’s inference results:

True Positive (TP): represents the correct prediction of the positive cases.
True Negative (TN): represents the correct prediction of the negative cases.
False Positive (FP): represents the negative cases that are predicted incorrectly as positive ones.
False Negative (FN): represents the positive cases that are predicted incorrectly as negative ones.

The accuracy, recall, precision and F1 score evaluation metrics are computed as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

Recall = \frac{T P}{T P + F N}

Precision = \frac{T P}{T P + F P}

F 1 score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

Different code-related tasks utilize evaluation metrics based on various scenarios. The accuracy metric is generally recommended when the data classification labels are balanced across classes. The precision metric is best suited for evaluating tasks where minimizing false positives is critical, such as in bug detection since frequently flagging bugs can be costly in such cases. In contrast, the recall metric is essential in tasks where minimizing false negatives is important, such as in assessing overall security coverage, where it is crucial to ensure that all security vulnerabilities are detected. In general, the F1 score and AUC metrics are considered effective in situations where data imbalance is a concern [231].

6.2.4. Security-Specific and Threat-Coverage Metrics

The evaluation metrics in this category are commonly used to assess neural models involved in program security analysis and threat prevention tasks. While a variety of metrics can be applied to such tasks, these particular metrics are distinguished by their specific areas of focus, which are often evident from their name and their computation terminologies. Threat Coverage (also known as vulnerability class coverage) serves as a primary example of this group of metrics. It measures the number of known categories of threats or vulnerabilities that a neural model can address or exploit during bug-hunting tasks. The core idea behind this evaluation metric is to quantify the diversity of threats a particular neural model can handle. As recently noted by Google DeepMind, analyzing security threat chains should be a fundamental building block of cyber-security systems, particularly in response to attacks initiated by AI-based incidents [232].

The Vulnerability Detection Score is a pairwise evaluation metric in this category [233]. It assesses a model’s ability to distinguish between vulnerable and non-vulnerable code, with a focus on minimizing the FN rate while keeping the FP rate below a particular fixed threshold. The Vulnerability Occurrence Rate is another example of a security analysis metric. It represents the fraction of code snippet samples in a dataset that contain at least one vulnerability. The study in [234] also explores how other metrics correlate with the vulnerability occurrence rate. Another evaluation metric in this category is the Attack/Exploit Success Rate, defined as the number of successful attacks divided by the total number of attack attempts [235]. Additionally, there are other criteria for evaluating the performance of neural models, such as plausibility, fixed bugs count, and correctness of fixed bugs. Most of these require manual intervention to make a final judgment.

6.2.5. Execution-Based and Functional Correctness Metrics

Execution-based evaluation metrics are among the most commendable and ideal metrics for programming-centric tasks involved in the generation of code snippets as model outputs. The top-k metrics, such as Pass@k, which are popular program-execution-based evaluation methods, assess a model’s performance by determining whether any of the k generated programs pass the given tests. In essence, Pass@k checks whether at least one of the top k model-generated programs passes all test cases, considering the problem to be solved if any of the k model-generated candidate code snippets is correct.

Similarly, the research work in [236] introduces three metrics—Success@k, Build@k, and Average Pass Rate—for evaluating code translation performance across entire repositories. Given M repositories in repository-level code translation benchmarks, we discuss these three metrics below along with their mathematical formulations.

Success@k (Test Pass Rate across Multiple Trials) measures the proportion of repositories that successfully pass all test cases in at least one of the selected k experimental rounds. Given a set

T_{j}

of repositories that pass all test cases in the j-th round, the cardinality

| T_{j} |

indicates the number of repositories that succeeded in that round. The Success@k metric is computed as shown below, where

(\binom{m}{k})

denotes the binomial coefficient, representing the number of ways to choose k rounds out of m total experimental rounds, and the union captures all repositories that pass in at least one of the selected trials:

Success @ k = \frac{1}{(\binom{m}{k}) \times M} \sum_{1 \leq j_{1} < j_{2} < \dots < j_{k} \leq m} |T_{j_{1}} \cup T_{j_{2}} \cup \dots \cup T_{j_{k}}|

Build@k (Build Success Rate across Multiple Trials), similar to Success@k, indicates the proportion of repositories that successfully compile in at least one of the k selected rounds. Let

B_{j}

denote the set of repositories that were successfully built in the j-th trial, then

Build @ k = \frac{1}{(\binom{m}{k}) \times M} \sum_{1 \leq j_{1} < j_{2} < \dots < j_{k} \leq m} |B_{j_{1}} \cup B_{j_{2}} \cup \dots \cup B_{j_{k}}|

Average Pass Rate, also known as the average test case pass rate, measures the mean proportion of test cases passed across all repositories and all experimental rounds. Assuming that

M_{p t}^{(r, s)}

denotes the number of passed test cases for the r-th repository in the s-th round, and

M_{a t}^{r}

denotes the total number of test cases for the r-th repository, the Average Pass Rate is computed as follows:

Average Pass Rate = \frac{1}{m} \sum_{s = 1}^{m} (\frac{1}{M} \sum_{r = 1}^{M} \frac{M_{p t}^{(r, s)}}{M_{a t}^{r}})

Another notable example is the Computational Accuracy (CA) metric, introduced in 2020 by Meta AI and adapted by other researchers [79,237], which is defined as follows:

CA = \frac{1}{N} \sum_{i = 1}^{N} CA (y_{i}, {\hat{y}}_{i})

CA (y_{i}, {\hat{y}}_{i}) = \{\begin{matrix} 1 & if {Exec}_{i} (y_{i}) = {Exec}_{i} ({\hat{y}}_{i}) \\ 0 & otherwise \end{matrix}

where N denotes the total number of samples, while

{Exec}_{i} (y_{i})

and

{Exec}_{i} ({\hat{y}}_{i})

represent the execution results of the ground truth sample

y_{i}

and the model-predicted sample

{\hat{y}}_{i}

, respectively. Given k model-generated candidates, CA@k then evaluates if at least one of them produces the correct output for all test inputs.

In addition, the Test Case Average (TCA@k) [238] quantifies the average number of test cases passed by k generated code fixes per problem. DSR@k proposed in [239] and adapted in [240] defines the proportion of samples whose output is correct within k debugging steps. Repair@k [241] measures the percentage of tasks successfully fixed after k rounds of feedback-driven code repair. Related metrics like DSI and Compilation@k or Repairable Rate (RR) [240] also contribute to the evaluation of a model’s iterative fix and recovery abilities. Another related metric, though more common in retrieval-based tasks, is Recall@k [242] which measures the fraction of relevant or ground-truth-equivalent code snippets retrieved within the top-k candidates of the returned results.

The release of HumanEval, a benchmark dataset from the experimental research in [243], marks a shift toward execution-based evaluation. This approach assesses generated code by testing its functionality, addressing the limitations of match based, embedding based, and other metrics that are involved in evaluating model-generated code. Ongoing research continues to refine execution-based metrics, aiming for more reliable evaluation of AI-generated code [238].

While execution-based metrics provide a fine-grained assessment of code correctness, their practical implementation poses significant challenges. These metrics rely heavily on high-quality test cases that are compatible with each generated code sample. For example, the CA metric, originally introduced for evaluating code translation, may encounter limitations in more complex scenarios. In multilingual settings like the research work explored in [239], which involves translation between diverse PLs, applying CA becomes less feasible. The core issue lies in the difficulty of establishing a unified execution framework capable of supporting a broad range of languages and their associated libraries, runtime engines, and dependencies.

Therefore, the adoption and effectiveness of execution-based evaluation could be significantly improved if future research develops standardized and accessible unified frameworks. Such frameworks should provide consistent runtime environments across multiple PLs to facilitate reproducibility and enable more rigorous evaluation in programming tasks.

7. Neural Methods vs. Traditional Approaches in Programming Tasks

This section explores the key advantages of neural networks for programming tasks and examines the challenges of neural-based approaches. It also provides a comparative analysis of the neural methods and other existing methods.

7.1. Rule-Based Approach

In general, rule-based methods rely primarily on predefined “if–then” rules and follow established procedures to guide the knowledge-discovery process to accomplish certain tasks [244,245,246]. They are powerful for defining structured rules, making them useful as supplementary components in statistical and neural models. However, these approaches suffer from maintainability issues. The programming ecosystem frequently evolves due to business strategy changes and user feedback, requiring continuous updates. Rule-based methods struggle to adapt to these evolving requirements, making them inflexible for long-term scalability.

7.2. Statistical Approach

Several SMT algorithms can serve as benchmarks for adapting NLP-based techniques to programming tasks. SMT approaches basically outperform rule-based methods by incorporating syntactic and semantic representations. For instance, Nguyen et al. [215] utilize phrase-based SMT for program translation in three phases: (1) treating source code as syntactic sequences between source and target languages, (2) integrating semantic information for API migrations, and (3) mapping sememes (semantic units) between PLs. Similarly, other studies have explored the use of SMT methods in generating pseudo-code using the source code of certain languages [53,196]. Such enhancements make SMT approaches superior to rule-based methods, which lack flexibility.

Despite their advantages, statistical models also face challenges. For example, the SLAMC model in [92] does not account for inheritance in object-oriented languages, leading to OOV issues. The mppSMT model in [215], which adopts a divide-and-conquer strategy for code migration, also struggles with dependency learning, API misalignment, and OOV problems. Furthermore, its applicability is primarily limited to Java and C#.

7.3. The Novelty of Neural Methods

Neural networks, inspired by the human brain, offer a novel and promising approach to programming tasks. These models excel in understanding the logic and structure of source code, which itself reflects human cognitive-based solutions. This synergy of analyzing human-created code with human-brain inspired neural net algorithms creates an exciting research intersection [247].

Neural methods leverage vast numbers of neural networks, especially deep neural networks (DNNs) and LLMs, allowing models with billions of parameters to enhance code understanding and generation tasks. While optimal solutions remain an ongoing research challenge, the use of large pre-trained models trained on both NL and PL datasets has significantly improved performance. However, despite their advantages, neural models face several challenges, particularly in the availability of well-structured parallel benchmark datasets.

High-quality training data is essential for the effectiveness of any neural network, but constructing such datasets is resource intensive. Hence, we can say that neural-based approaches provide significant improvements over rule-based and statistical methods, yet overcoming high-quality parallel data limitations and other challenges remains a key area for future research. Table 8 summarizes the pros and cons of these three approaches.

8. Discussion and Future Work

In this section, we revisit the research questions that guided our study and highlight potential directions for future work.

8.1. Answers to the Research Questions

8.1.1. RQ1: How Do Neural Approaches Compare to Rule-Based and Statistical Methods in the Context of Programming Tasks?

Neural methods represent the most recent paradigm in applying machine learning to programming tasks, such as source code translation, by leveraging DNNs and large-scale pre-training. In contrast, rule-based approaches rely on manually crafted transformation rules (often operating over ASTs), and statistical methods adapt classical SMT techniques to code (treating programs as literal text sequences).

Neural methods mark a significant leap forward compared to rule-based and statistical approaches for programming tasks, by providing greater adaptability, improved semantic accuracy, and scalable performance on large datasets. However, they demand extensive computational resources, robust parallel or monolingual corpora, and face ongoing challenges in evaluation and interpretability. Rule-based and statistical techniques are still relevant in resource-constrained or safety-critical settings, but for most modern software engineering needs, where languages, libraries, and idiomatic styles rapidly evolve, neural approaches have emerged as the state of the art. As the field progresses, hybrid solutions that integrate symbolic constraints into neural pipelines, alongside richer benchmarks and evaluation frameworks, will likely alleviate the remaining gaps between quality of automated programming tasks and human expert performance.

8.1.2. RQ2: What Is the Current Landscape of Datasets and Benchmarks for Neural Methods in Programming Tasks, and What Are the Critical Gaps?

The landscape of datasets and benchmarks for the application of neural method to programming tasks has expanded dramatically over the past decade, evolving from small, narrowly scoped corpora to large-scale, multi-task corpora. Early efforts (around 2015–2017) focused primarily on mapping NL to code in a single PL, most notably Python, while more recent initiatives (2020–2025) have sought to cover a broader array of tasks (e.g., vulnerability detection, code translation, and code summarization) and multiple PLs (including Java, C/C++, Go, Rust, and other multilingual settings). Despite this progress, several critical gaps remain: most datasets still suffer from limited scope (both in terms of size and language diversity), inconsistent preprocessing and documentation, noisy annotations (particularly in NL annotations or particular task’s labels), and a lack of standardized evaluation pipelines.

The field of programming has benefited from a proliferation of datasets and benchmarks spanning several software engineering and programming tasks. Early released code corpora (e.g., Django, PCSD, and CoNaLa) laid the groundwork but often suffered from limited scale and scope. Subsequent benchmarks (e.g., TransCoder, CodeXGLUE, CodeNet, and XLCoST) addressed multi-language and multi-task needs but sometimes at the cost of noisy annotations or inconsistent data format across similar tasks. Most recently, specialized datasets such as [209,242,248,249] have pushed toward class-level generation and real-world repository-level code intelligence tasks along with execution-based evaluation practice. However, critical gaps, such as the lack of standardized preprocessing, detailed documentation, under-representation of many PLs and domains, insufficient benchmarks for multi-file settings, interactivity, and explainability in program repair, as well as real-world performance, still persist. Addressing these gaps will require community-wide efforts to create unified dataset schemas, enrich annotation quality, broaden coverage, and develop richer evaluation methodologies, ultimately enabling more robust, reproducible, and impactful research in the application of neural methods for programming tasks.

8.1.3. RQ3: Which Evaluation Metrics Best Capture Model Performance on Code, Both Syntactically and Functionally, and Where Do Standard NLP Metrics Fall Short?

Evaluating code generation and code understanding models demands metrics that go beyond surface-level text similarity. While traditional NLP metrics (e.g., BLEU, ROUGE, and METEOR) provide a convenient baseline, especially during early stages of model development, their reliance on n-gram overlap prevents them from fully capturing either the syntactic correctness or the functional (execution) behavior of generated code. Even enhanced variants like CodeBLEU and CrystalBLEU, which incorporate AST and data-flow matching, cannot guarantee the generated snippets’ compilation or runtime tests. As a result, they serve as convenient proxies during early model inference trials but offer only limited insight into true code correctness.

The most reliable way to assess code model performance combines syntax checks (e.g., Compilation@k and Build@k) with execution-based metrics such as Pass@k and CA. Compilation@k ensures that generated programs meet language-grammar requirements, while Pass@k and CA verify that at least one of the top-k candidates successfully compiles and produces correct outputs against a comprehensive test suite. Metrics like Success@k, Average Pass Rate, and TCA@k further enrich this evaluation by measuring test pass rates over multiple trials or averaging partial successes, offering a fine-grained view of both build success and functional correctness of output results.

Standard NLP metrics fall short because they neither enforce compilability nor check whether code handles edge cases or real inputs correctly. In other words, they treat code as “flat text”, ignoring compilation requirements and runtime behavior. Embedding-based scores (e.g., CodeBERTScore and Cosine Similarity) improve semantic alignment but still treat code as text and cannot substitute for actual execution checks. Consequently, a robust evaluation framework for code requires prioritizing execution-based measures, backed by reliable test harnesses and multi-language runtime environments, while using text-based and embedding-based metrics only as supplementary signals during early development.

8.1.4. RQ4: What Roles Do LLMs (e.g., GPT-4, LLaMA, and Claude) Play in Programming Tasks?

LLMs originally developed for NLP have been rapidly repurposed to address a wide spectrum of programming tasks, fundamentally altering how code is implemented, reviewed, and maintained. By leveraging the vast corpora of open-source code and documentation, models such as GPT-4, LLaMA, and Claude have demonstrated an ability to generate syntactically correct and often semantically meaningful code snippets from high-level specifications. These models are now routinely employed in interactive, multi-turn dialogues where developers describe functionality or constraints in natural language, and the LLM iteratively refines code to meet functional objective of particular developer defined problem. LLMs generate test cases, and documentations for an entire program. In this capacity, LLMs serve not only as autocompletion engines but also as dynamic programming partners that adapt to evolving requirements and project contexts.

Beyond standalone code generation, LLMs have been integrated into specialized workflows that mirror traditional software-development life-cycle phases [250]. For instance, self-collaborative frameworks distribute subtasks among multiple LLM agents, each responsible for activities such as requirements analysis, architectural design, unit-test creation, and bug-fix proposals; intermediate artifacts generated by one agent become prompts for the next, producing an end-to-end automated pipeline. In parallel, security-oriented studies reveal that LLM outputs can introduce or exacerbate vulnerabilities, prompting the development of hybrid pipelines that combine LLM-driven patch generation with automated test harnesses or static analyzers to validate correctness and security [251]. Furthermore, code-specific LLMs trained and/or fine-tuned on large PL datasets underlie tools such as GitHub Copilot [252], Cursor, Windsurf, Replit and other programming-specialized AI-chatbots, and utilize LLM-powered AI agents enabling language-agnostic autocompletion, cross-language translation, and even real-time code-review suggestions within integrated development environments.

While LLM-powered development workflows significantly reduce manual implementations, automate boilerplate coding, accelerate test case synthesis, and facilitate rapid prototyping, they also introduce new challenges related to model hallucinations, and hence prompt a need for rigorous validation. Hallucinated library calls, misaligned data-flow assumptions, or omitted edge-case handling can produce code that compiles but behaves incorrectly or insecurely. The computational footprint of large models further constrains on-premises adoption. Despite these limitations, the integration of LLMs into code generation, completion, self-collaborative workflows, and security validation represents a transformative shift in programming, enabling developers to redirect their expertise toward higher-order design and architectural challenges while entrusting routine, syntax-oriented tasks to sophisticated generative models.

8.1.5. RQ5: What Are the Main Bottlenecks in Scaling and Deploying Neural-Based Programming Solutions to Real-World Codebases, and How Can They Be Addressed?

Neural-based programming tasks face several interrelated bottlenecks that hinder their scalability and deployment in real-world codebases. First, the scarcity of high-quality parallel corpora [253] across diverse PLs remains a critical obstacle. The majority of existing datasets are often narrowly focused, covering only a handful of modern PLs along with limited patterns such as NL query and code [254,255], or code–comment pairs, while legacy languages (e.g., COBOL and FORTRAN) and domain-specific languages (e.g., hardware description languages and smart contracts [256]) remain under-represented. Moreover, individual dataset pipelines lack standardization: tokenization strategies, comment handling heuristics, and metadata schemas are often undocumented, hindering reproducibility and fair model comparison. Even when aligned corpora exist (e.g., function-level translations between Java and C#), most corpora are limited to a widely studied couple of language pairs, leaving many language combinations with insufficient supervised data. Additionally, noisy or incomplete annotations, such as shallow doc-strings or heuristically derived vulnerability labels, further degrade model performance, as downstream fine-tuning depends on precise semantic alignments.

Second, the computational and evaluation challenges of neural approaches pose significant barriers to real-world adoption. Transformer-based models trained on massive code corpora require extensive GPU/TPU resources over prolonged periods, imposing both financial and environmental costs that are prohibitive for many academic and industrial settings. Even model inference phase may require optimized set of infrastructure to deliver low-latency, high-throughput code generation, or program understanding services. From an evaluation standpoint, conventional NLP metrics (e.g., BLEU, METEOR, and ROUGE) fail to capture code-specific syntactic and semantic correctness: two semantically equivalent implementations may receive drastically different n-gram scores, while superficially similar, functionally incorrect code can be rewarded [219]. Although specialized metrics such as CodeBLEU [96] have integrated AST-based comparisons, they do not reflect the runtime behaviors of evaluated program. Dynamic evaluation measures such as Pass@k, DSK@k, and CA incorporate test-based validation [79,257]. However, the availability of comprehensive and end-to-end evaluation pipelines remain scarce—particularly for multi-file projects, which demand integration tests, style readability, and each programming language’s coding idiom assessments. Additionally, the black-box nature of neural models presents significant challenges of interpretability and debugging practices. Understanding why a model omits a critical null check or mismatches API calls is far more complex than tracing crashes in traditional programs. Therefore, greater emphasis must be placed on human-in-the-loop evaluation strategies [258,259].

To overcome these bottlenecks, a multifaceted strategy is needed. First, primary efforts should focus on building standardized, end-to-end dataset pipelines that span both modern and legacy PLs, provide rich metadata (e.g., ASTs, DFGs, doc-strings, test cases, and error logs), and automate reproducible preprocessing steps (e.g., tokenization, de-duplication, and splitting). Semi-supervised and synthetic data generation techniques, leveraging unsupervised back-translation, template-based augmentation, and cross-lingual transfer, can help mitigate parallel data scarcity. Second, efficiency-focused model architectures (e.g., distillation, quantization, and efficient-tuning) combined with mixed-precision training can reduce the compute footprint, thereby democratizing access to large-parameter models for resource-constrained researchers. Third, robust evaluation frameworks should integrate both static (e.g., Compilation@k) and dynamic (e.g., Pass@k and CA) metrics, alongside novel benchmarks for multi-file, long-form code generation, idiomatic constraint, and security vulnerability severity [234]. Finally, enhancing model interpretability through attention visualization, rationale generation, and the integration of symbolically grounded validators or test-based debugging loops will foster greater trust and facilitate the deployment of neural-based solutions in production codebases.

8.1.6. RQ6: How Have Neural Methods for Programming Evolved, and Which Model- and System-Level Advances Have Driven This Progression?

The utilization of neural methods in programming tasks has been progressing rapidly. Examining the trends of utilizing automatic tools in programming and software development methodologies, we observe a transition from rule-based to statistical methods, and ultimately to neural-based approaches. This shift has significantly improved quality, performance, adaptability, and novelty. Neural-based methods, in particular, have experienced rapid growth, with many models achieving SOTA performance. As shown in Figure 6, in the earliest stage (1943–late 1980s), neural methods in programming were largely theoretical: the perceptron, basic neuron models, and backpropagation laid the mathematical foundation. The enhancement of that period was purely conceptual, advancing the understanding of learning algorithms rather than delivering tangible gains in code analysis or generation.

During the shallow neural network era (1990s–mid 2000), multilayer perceptrons and early convolutional neural networks began to show incremental enhancements over statistical baselines in tasks like code classification, effort estimation, and defect prediction. From mid 2000 until around 2016, deep learning models, such as Word2Vec, RNNs, LSTM, GRU, and other Seq2Seq models augmented with attention, marked a better improvement for the advancement of neural networks. This gave rise to experimental tools for comment generation, code completion suggestion, code summarization, and automated patch recommendation. Although these systems captured syntactic patterns, they lacked semantic awareness: variable scopes, control flows, and data dependencies across larger contexts remained difficult to represent, limiting their reliability for complex codebases.

Between 2017 and 2021, two parallel streams of innovation converged: graph-based models that embedded ASTs and CFGs into graph neural networks (e.g., GGNN and RGCN), and the Transformer architecture which revolutionized the sequence modeling, enabling parallel processing of long sequences with long-range dependencies, hence marking the significant shift to pre-training various neural models on massive code corpora (e.g., CodeBERT and CodeT5). The former produced richer, semantically grounded embeddings for tasks such as bug localization and code summarization; the latter endowed foundation models with deep contextual understanding and transfer learning capabilities. Systems integrating these models began to automate several tasks such as code search, code translation, and program repair with far greater precision and generality than previous systems.

Since around 2022, large-scale LLMs and Code LLMs (e.g., Codex, CodeGen, CodeLLaMA, and CodeT5+) have advanced the use of neural methods into interactive AI-pair programming and AI-powered software engineering workflows. Embedded in the IDE code-assistants, such as GitHub Copilot, Visual Studio IntelliCode, and CodeWhisperer, these models provide real-time, context-aware code completions, automated reviews, and patch suggestions. Crucially, modern neural-based systems often validate model outputs by running tests or static analyzers before presenting results, ensuring higher correctness. Emerging multi-agent systems further point toward the autonomous orchestration of programming tasks—from requirement parsing to development of functionally correct program codes—transforming neural models from simple code assistants into intelligent co-developers. Hence, the advancement of neural methods in programming and SE represents a rapidly evolving revolution, emerging in recent years and continuing to reshape efficiency and effectiveness across a wide range of tasks, models, and the entire programming ecosystem.

8.2. Future Work and Open Issues

Although neural methods, particularly recent variants such as LLMs, have significantly advanced day-to-day programming tasks, several research challenges remain. We categorize these challenges into five thematic areas, providing concise discussions while preserving the full scope of open research issues investigated during our review.

8.2.1. Evaluation Metric Issues

Code Semantics and Runtime-Behavior-Aware Metrics

Existing NLP metrics (e.g., BLEU and ROUGE) often fail to capture functional equivalence. Syntactic measures such as CodeBLEU, embedding-based metrics, and AST-based distances serve as preliminary filters but require supplementation with dynamic, execution-driven evaluation. Future work should implement end-to-end pipelines that first ensure Compilation@k and then validate runtime behavior via extensive test suites with Pass@k, CA, or DSR@k.

Multi-Objective Scoring

Code quality extends beyond correctness to other factors such as readability, maintainability, robustness, and performance. However, no standard metrics exist for code style, maintainability, or security severity. Benchmarks should integrate the following:

Readability/Lint Scores: Combine linters (e.g., PEP8 and Checkstyle) and complexity analyzers into aggregate scores.
Security-severity Evaluations: Quantify vulnerability severity (e.g., exploitability and levels of severity) rather than binary detection.
Functional Performance Tests: Include runtime and memory benchmarks, especially for large or multi-file projects, to measure efficiency trade-offs.

Interactive and Explainable Evaluation

As interactive code synthesis grows, benchmarks should simulate iterative feedback loops by tracking metrics such as the number of iterations to achieve 100% test pass or the time to convergence. Additionally, when models produce natural-language explanations, new evaluation solutions are required to score logical consistency alongside functional correctness, potentially incorporating human-in-the-loop judgments. Moreover, the rise of LLM-powered AI agents underscores the urgent need for standardized, interactive, real-time evaluation frameworks, an area that remains largely unexplored in the AI research community.

8.2.2. Data Pipelines and Code Representation

Unified Dataset Schema and Code Representation

There is no consensus on a unified representation for code samples (e.g., raw text, raw code, AST, CFG, and IR) and their associated metadata (e.g., NL descriptions, test suites, and error logs). Future datasets should adopt a fine-grained schema for source code, specifying the following:

Tokenization conventions (e.g., tokenization and comment stripping).
Possible combinations of code representation formats (e.g., ASTs, CFGs, and DFGs).
Normalization of NL texts aligned with code (e.g., wording, structure, and phrasing).
Test-suite inclusion [260], covering unit tests and integration tests.

Reproducible, End-to-End Pipelines

Many published corpora omit detailed preprocessing steps, hindering reproducibility. Future efforts should document all transformations (e.g., code normalization, dataset splits, data-format, filtering and utilized tools) to allow an exact replication of baselines.

Expanding Language and Domain Coverage

Most corpora focus on Python, Java, JavaScript, C/C++, C#, and Go. Under-represented languages (e.g., Rust, Kotlin, Swift, and CUDA) lack large, high-quality parallel datasets. Future research should perform the following:

Mine and Curate Cross-Lingual/Domain Corpora: Use automatic alignment (e.g., class-level and project-level) to build parallel datasets for less-common pairs (e.g., Solidity−Java).
Incorporate Non-English NL Data: Collect doc-strings and comments in Arabic, Japanese, Korean, etc., to enable truly multilingual AI-powered programming agents (e.g., Korean−Java).
Gather Domain-Specific Corpora: Develop benchmarks for Infrastructure-as-Code (e.g., Terraform), hardware design (e.g., Verilog), embedded systems, and financial smart contracts (e.g., Solidity), reflecting their unique syntax and semantics.

8.2.3. Hybrid Neural Methods with Symbolic Architectures

Neuro-Symbolic Integration for Semantic Guarantees

Purely neural method generators risk hallucinations, such as improper handling of errors during the reasoning process. Hybrid pipelines can interleave:

Neuro-symbolic Validators: Post-generation static analysis (e.g., linters, model checkers, formal program verifiers) to flag or correct errors.
Runtime Test Feedback: Compile and run generated code immediately. On failure, invoke a repair module or re-prompt the model.

Symbolic Augmentation for Certain Tasks

For vulnerability detection, integrating taint analysis, symbolic execution, or fuzzing can lower false positives and improve generalization. Program repair systems benefit from neural-based patch generation combined with symbolic equivalence checking.

8.2.4. Resource Efficiency and Deployment Constraints

Model Compression and Distillation

Larger LLMs achieve state-of-the-art results but are impractical for on-device or resource-constrained settings. Future work should explore the following:

Knowledge Distillation: Train compact student models to approximate the behavior of larger teacher models.
Parameter-Efficient Fine-Tuning (PEFT): Apply low-rank updates and quantization to adapt backbone weights, reducing storage and compute overhead [199].
Edge-Optimized Libraries: Develop inference libraries for mobile or embedded hardware for balancing latency and accuracy when running neural models on edge devices.

8.2.5. Task-Specific Directions

Code Translation and Decompilation

Translation and decompilation should preserve the semantic meaning of code segments in end-to-end migration scenarios across source and target languages. Beyond surface metrics (e.g., BLEU and CodeBLEU), evaluation must incorporate semantic and runtime behavior comparisons, type-checker coverage, and execution outcomes. Architecturally, we have the following:

Use specialized encoders that align parallel corpora at the function, class, or repository level.
Employ decoders that enforce the preservation of target-language idioms (e.g., type inference and library-call mapping).
For neural decompilers, leverage models that can comprehend control-flow and data-flow structures extracted from binaries, supporting multiple architectures (e.g., x86, ARM, and RISC-V) and handling obfuscation (e.g., packing and control-flow flattening).

Large-Scale Generation, Synthesis, and Automated Program Repair

Current benchmarks focus on single functions, but real-world development often spans entire projects. Future datasets should supply the following:

Complete project skeletons such as minimal web apps with front-end, backend, database scripts, Continuous Integration and Continuous Deployment (CI/CD) configurations, to validate multi-file systems.
Large, stable program-repair datasets: Curate commit pairs annotated by intent (e.g., bug fix, refactoring, performance optimization, and security patch) and integrate generate→test→patch loops where proposed edits are applied and validated against test suites. Coupling LLM outputs with program verification tools and multi-round repair logic will enable end-to-end automated program repair frameworks that compile and pass correctness checks.

Code Search, Retrieval, and Clone Detection

Future code search systems should ensure that retrieved snippets compile and pass basic program tests by combining static analysis with neural-based techniques, rather than relying on syntax alone. Clone detection should advance beyond the same language heuristics: models should embed joint ASTs, data-flow, and other representations to detect cross-language clones (e.g., C++−Rust). Crucially, systems should differentiate syntactic clones (boilerplate) from intent-based similar programs (semantically equivalent algorithms implemented differently), using semantic-level features (e.g., control-flow and data-flow equivalences) to prioritize intent.

Comment Generation, Code Summarization, and Authorship Attribution

To improve code readability, comment-generation models should support multiple languages: build parallel corpora of code with comments in languages other than English and evaluate multilingual consistency. Code summarization should be semantics-aware, and metrics should verify that generated summaries capture control flow, side effects, and algorithmic complexity by comparing against gold-standard semantic annotations (e.g., CFGs annotated with NL intents). Code authorship and attribution tools should be able to handle code from mixed sources, including human-written code, IDE templates, and AI assistant-generated code. Techniques fusing stylistic features (e.g., code idioms, n-grams, and AST-based style features) with provenance metadata (e.g., commit logs and file metadata) can attribute authorship across PLs.

Code Completion and Privacy

AI-assisted code completion systems should mitigate security and privacy risks. Repeated prompt templates may produce near-identical code segments, propagating flaws. Future systems should introduce controlled stochasticity or template diversification while preserving syntactic and semantic correctness. Training or fine-tuning on proprietary and personal codebases requires confidentiality guarantees; differentially private training algorithms or federated learning frameworks—where model updates are aggregated without exposing raw code elements—are largely unexplored in the programming research domain and should be prioritized to prevent leakage of sensitive source code parts.

By addressing the above-mentioned open issues, such as richer evaluation methods, unified data pipelines, neuro-symbolic integration, high-quality multilingual corpora, and resource-efficient architectures, the research community can advance toward reliable, secure, and accessible AI-driven software development workflows.

8.3. Trends, Risks, and the Road Ahead

While neural-based methods have rapidly advanced the capability of automating various fields across the globe, including programming, the field may encounter subtle risks of trend-driven adoption without rigorous architectural validation and reproducible evaluation. A major challenge lies in the acceleration of model deployment cycles, where experimental prototypes are prematurely promoted to production systems without a robust assessment of long-term reliability, ethical safety, or software quality impacts. This issue is especially critical in programming and software engineering, where neural models directly influence codebases that underpin large-scale industrial and financial infrastructures.

From a statistical perspective, our bibliometric and temporal analyses (Figure 3 and Figure 4) reveal a clear exponential rise in Transformer-based neural models from 2020 onward, with decoder-only architectures (e.g., GPT, LLaMA, and CodeGemma) dominating recent publications. Meanwhile, encoder–decoder and encoder-only designs, which have historically been effective for certain critical tasks, such as semantic and code translation tasks, are showing a relative decline in frequency, signaling a paradigm shift that may inadvertently deprioritize tasks better suited to their structures. The data in our collected publications further indicates that over 62% cited research works published in the year 2024–2025 rely on decoder-only architectures, emphasizing this evolving imbalance.

Emerging research in AI agents and multi-agent systems, integrated with neural code models, illustrates another forward trajectory. However, such agents often lack a defined alignment with the Software Development Life Cycle (SDLC), leading to ad hoc experimentation rather than systematic engineering integration. The next decade is likely to witness “coding agents” capable of generating enterprise-grade applications through high-level natural language instructions. Yet, realizing this vision safely requires verifiable performance measurement frameworks and ethically aware validation procedures that ensure trust, reproducibility, and resilience in code-generation and AI-augmented system-development practices.

In summary, sustainable progress in neural methods for programming depends on bridging three gaps: (1) establishing standard evaluation pipelines; (2) balancing architectural diversity across model families; and (3) embedding AI-agent development within the established SDLC frameworks. Addressing these will transform speculative enthusiasm into scientifically grounded and socially responsible advancement.

9. Conclusions

In synthesizing the reviewed literature, three major conclusions emerge. First, Transformer-based neural architectures, particularly encoder–decoder models such as CodeT5, and decoder-only LLMs such as GPT, Gemini, CodeLLaMA, and CodeGemma represent the most promising and versatile paradigms for programming-related tasks. Their capacity to learn unified representations across natural and programming languages has enabled remarkable progress in code generation, code summarization, code translation, and other SE tasks. Second, despite these advances, several persistent challenges remain unresolved: the absence of semantic and functionality-aware evaluation of generated code, the limited integration of graph-based or structure-aware models with Transformers, and the scarcity of multilingual and multi-paradigm datasets that impede robust cross-language generalization. Third, as outlined in Section 8, future progress will extend on unifying architectural innovation with standard evaluation frameworks and ethically guided deployment. Bridging these dimensions by coupling large-scale models with interpretable and verifiable code representations marks a concrete path toward reliable, safe, and industrially adoptable neural methods for programming tasks.

This survey systematically characterizes the landscape of neural methods in programming and SE, categorizing tasks through an extensive analysis of over 250 scientific papers. It critically examines the objectives, contributions, and limitations of previous surveys while offering a comprehensive comparison of neural methods across natural language and programming language tasks. Furthermore, it contrasts neural approaches with rule-based and statistical paradigms, providing a clear historical and technical evolution of the field. The paper also offers an exhaustive synthetical analysis of datasets and evaluation metrics, which has organized the entire performance evaluation metrics into five taxonomies accompanied by their mathematical formulations and benchmark analyses.

The overall findings indicate that recent advances in LLMs and Code LLMs are fundamentally transforming programming practices. This transformation is amplified by the emergence of multi-agent and context-aware systems that embed LLMs into the broader SE workflow. While the survey encompasses a broad spectrum of representative studies, some recently published breakthroughs are excluded for scope balance, particularly those overlapping with general NLP research. Although no empirical experiments were conducted, the conceptual analyses presented here lay a foundation for evidence-based future studies. Empirical investigations into the limitations of evaluation metrics such as BLEU, CodeBLEU, EM, and ROUGE would further substantiate our claims regarding their insufficiency in capturing semantic correctness.

Persistent challenges remain across most neural methods applied to programming and SE. These include the lack of standard, multi-modal code representations, the limited availability of high-quality parallel datasets, and the non-standardized use of evaluation metrics. Many models continue to rely on surface-level metrics that fail to capture semantic and functional fidelity. Addressing these limitations through improved representation learning, unified evaluation frameworks, and robust multilingual datasets will be pivotal for advancing the reliability, interpretability, and real-world applicability of neural methods for programming.

Author Contributions

Conceptualization, G.G.M. and H.I.; methodology: G.G.M., S.J., and H.I.; investigation: G.G.M.; data curation: G.G.M. and S.L.; writing—original draft preparation: G.G.M. and S.J.; writing—review and editing: G.G.M., S.L., S.J., S.-K.K., and H.I.; visualization: G.G.M. and S.L.; supervision: S.-K.K. and H.I.; project administration: H.I.; funding acquisition: H.I.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00208094) and by Basic Science Research Program through the NRF funded by the Ministry of Education (RS-2024-00463967 and No. 25411243). This research was also supported by the Regional Innovation System & Education (RISE) program through the Gangwon RISE Center, funded by the Ministry of Education (MOE) and the Gangwon State (G.S.), Republic of Korea (2025-RISE-10-002).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used throughout this manuscript:

Acronym	Full Form
AI	Artificial Intelligence
ANNS	Approximate Nearest Neighbor Search
API	Application Programming Interface
APR	Automatic Program Repair
APPS	Automated Programming Progress Standard
AST	Abstract Syntax Tree
AUC	Area Under the ROC Curve
AVDHRAM	Automated Vulnerability Detection framework based on Hierarchical Representation and Attention Mechanism
BERT	Bidirectional Encoder Representations from Transformers
Bi-LSTM	Bidirectional Long Short-Term Memory
BLEU	Bilingual Evaluation Understudy
BT	Back-Translation
BTC	Beyond the C
C2RUST	C to Rust Translator
C-BiGRU	Convolutional Neural Network–Bidirectional Gated Recurrent Unit
CA	Computational Accuracy
CARLCS-CNN	Co-Attentive Representation Learning Code Search—Convolutional Neural Network
CFG	Control Flow Graph
CI/CD	Continuous Integration/Continuous Deployment
CIRCLE	Continual Repair aCross Programming LanguagEs
CNN	Convolutional Neural Network
CODEnn	Code-Description Embedding Neural Network
CODIT	Contextualized Code Editing with Neural Machine Translation
CORE	COde REview Engine
COSAL	Code-to-Code Search Across Languages
CVE	Common Vulnerabilities and Exposures
DAE	Denoising Autoencoding
DFG	Data Flow Graph
DGMS	Deep Graph Matching and Searching
DL	Deep Learning
DNN	Deep Neural Network
DSI	Debugging Success Improvement
DSR@k	Debugging Success Rate at rank k
EM	Exact Match
ENCORE	Ensemble Learning using Convolution Neural Machine Translation for Automatic Program Repair
FP	False Positive
FN	False Negative
GAN	Generative Adversarial Network
GCJ	Google Code Jam
GCN	Graph Convolutional Network
GGNN	Gated Graph Neural Network
GNN	Graph Neural Network
GPT	Generative Pre-trained Transformer
GRU	Gated Recurrent Unit
IABLSTM	Impact Analysis with Attention Bidirectional Long Short-Term Memory
IDE	Integrated Development Environment
IR	Intermediate Representation
LLM	Large Language Model
LSTM	Long Short-Term Memory
MBPP	Mostly Basic Programming Problems
METEOR	Metric for Evaluation of Translation with Explicit ORdering
ML	Machine Learning
MMAN	Multi-Modal Attention Network
MNB	Multinomial Naive Bayes
MPLCS	Multi-Programming Language Code Search
MRR	Mean Reciprocal Rank
NDCG@k	Normalized Discounted Cumulative Gain at rank k
NLM	Neural Language Model
NLP	Natural Language Processing
NMT	Neural Machine Translation
NVD	National Vulnerability Database
OOV	Out-of-Vocabulary
PATOIS	Probabilistic Tree Grammar–Based Program Synthesizer
PCSD	Python Code and Doc-string pairs
PEFT	Parameter-Efficient Fine-Tuning
PL	Programming Language
pTSG	Probabilistic Tree Substitution Grammar
RAE	Recursive Autoencoder
RGCNs	Relational Graph Convolutional Networks
ResNet	Residual Neural Network (Residual Network)
RNN	Recurrent Neural Network
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
RR	Repairable Rate
SBCS	SentenceBERT + Cosine Similarity
SBED	SentenceBERT + Euclidean Distance
SDLC	Software Development Life Cycle
SE	Software Engineering
SLAMC	Semantic Language Model for Source Code
SMT	Statistical Machine Translation
SOTA	State-of-the-Art
SVM	Support Vector Machine
T5	Text-to-Text Transfer Transformer
TCA@k	Test Case Average at rank k
TF-IDF	Term Frequency–Inverse Document Frequency
TN	True Negative
TP	True Positive
TSED	Tree Similarity of Edit Distance
ViT	Vision Transformer
XFG	Contextual Flow Graph
XML	eXtensible Markup Language
XOR	Exclusive OR

References

Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Kang, D.; Hovy, E. Plan ahead: Self-Supervised Text Planning for Paragraph Completion Task. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6533–6543. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Cao, J.; Li, M.; Wen, M.; Cheung, S. A study on prompt design, advantages and limitations of ChatGPT for deep learning program repair. Autom. Softw. Eng. 2025, 32, 30. [Google Scholar] [CrossRef]
Zhao, H.; Hui, J.; Howland, J.; Nguyen, N.; Zuo, S.; Hu, A.; Choquette-Choo, C.A.; Shen, J.; Kelley, J.; Bansal, K.; et al. CodeGemma: Open code models based on Gemma. arXiv 2024, arXiv:2406.11409. [Google Scholar] [CrossRef]
Yang, Z.; Liu, F.; Yu, Z.; Keung, J.W.; Li, J.; Liu, S.; Hong, Y.; Ma, X.; Jin, Z.; Li, G. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. Proc. ACM Softw. Eng. 2024, 1, 1585–1608. [Google Scholar] [CrossRef]
Ahmad, W.U.; Tushar, M.G.R.; Chakraborty, S.; Chang, K.W. AVATAR: A Parallel Corpus for Java-Python Program Translation. In Findings of the Association for Computational Linguistics: EACL 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 2268–2281. [Google Scholar] [CrossRef]
Ahmad, W.; Chakraborty, S.; Ray, B.; Chang, K.W. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4998–5007. [Google Scholar] [CrossRef]
Chakraborty, S.; Ray, B. On multi-modal learning of editing source code. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, 15–19 November 2021; IEEE Press: New York, NY, USA, 2021; pp. 443–455. [Google Scholar] [CrossRef]
Katz, D.S.; Ruchti, J.; Schulte, E. Using recurrent neural networks for decompilation. In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), Campobasso, Italy, 20–23 March 2018; pp. 346–356. [Google Scholar] [CrossRef]
Wu, Y.; Wang, W. Code Similarity Detection Based on Siamese Network. In Proceedings of the 2021 IEEE International Conference on Information Communication and Software Engineering (ICICSE), Chengdu, China, 19–21 March 2021; pp. 47–51. [Google Scholar] [CrossRef]
Zhang, C.; Wang, J.; Zhou, Q.; Xu, T.; Tang, K.; Gui, H.; Liu, F. A Survey of Automatic Source Code Summarization. Symmetry 2022, 14, 471. [Google Scholar] [CrossRef]
Uddin, M.N.; Zhang, Y.; Hei, X. Deep Learning Aided Software Vulnerability Detection: A Survey. arXiv 2025, arXiv:2503.04002. [Google Scholar] [CrossRef]
Yang, Y.; Xia, X.; Lo, D.; Grundy, J. A Survey on Deep Learning for Software Engineering. ACM Comput. Surv. 2022, 54, 1–73. [Google Scholar] [CrossRef]
Fontes, A.; Gay, G. The integration of machine learning into automated test generation: A systematic mapping study. Softw. Test. Verif. Reliab. 2023, 33, e1845. [Google Scholar] [CrossRef]
Allamanis, M.; Barr, E.T.; Devanbu, P.; Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput. Surv. (CSUR) 2018, 51, 1–37. [Google Scholar] [CrossRef]
Samoaa, H.P.; Bayram, F.; Salza, P.; Leitner, P. A systematic mapping study of source code representation for deep learning in software engineering. IET Softw. 2022, 16, 351–385. [Google Scholar] [CrossRef]
Amalfitano, D.; Faralli, S.; Hauck, J.C.R.; Matalonga, S.; Distante, D. Artificial Intelligence Applied to Software Testing: A Tertiary Study. ACM Comput. Surv. 2023, 56, 1–38. [Google Scholar] [CrossRef]
Durrani, U.K.; Akpinar, M.; Adak, M.F.; Kabakus, A.T.; Öztürk, M.M.; Saleh, M. A Decade of Progress: A Systematic Literature Review on the Integration of AI in Software Engineering Phases and Activities (2013–2023). IEEE Access 2024, 12, 171185–171204. [Google Scholar] [CrossRef]
Sofian, H.; Yunus, N.A.M.; Ahmad, R. Systematic Mapping: Artificial Intelligence Techniques in Software Engineering. IEEE Access 2022, 10, 51021–51040. [Google Scholar] [CrossRef]
Crawford, T.; Duong, S.; Fueston, R.; Lawani, A.; Owoade, S.; Uzoka, A.; Parizi, R.M.; Yazdinejad, A. AI in Software Engineering: A Survey on Project Management Applications. arXiv 2023, arXiv:2307.15224. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Ma, Y.; Sun, W.; Chen, Z. A Survey of Learning-based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–69. [Google Scholar] [CrossRef]
Xiaomeng, W.; Tao, Z.; Wei, X.; Changyu, H. A survey on source code review using machine learning. In Proceedings of the 2018 3rd International Conference on Information Systems Engineering (ICISE), Shanghai, China, 4–6 May 2018; pp. 56–60. [Google Scholar]
Akimova, E.N.; Bersenev, A.Y.; Deikov, A.A.; Kobylkin, K.S.; Konygin, A.V.; Mezentsev, I.P.; Misilov, V.E. A survey on software defect prediction using deep learning. Mathematics 2021, 9, 1180. [Google Scholar] [CrossRef]
Xie, Y.; Lin, J.; Dong, H.; Zhang, L.; Wu, Z. Survey of Code Search Based on Deep Learning. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–42. [Google Scholar] [CrossRef]
Zakeri-Nasrabadi, M.; Parsa, S.; Ramezani, M.; Roy, C.; Ekhtiarzadeh, M. A systematic literature review on source code similarity measurement and clone detection: Techniques, applications, and challenges. J. Syst. Softw. 2023, 204, 111796. [Google Scholar] [CrossRef]
Al-Hossami, E.; Shaikh, S. A survey on artificial intelligence for source code: A dialogue systems perspective. arXiv 2022, arXiv:2202.04847. [Google Scholar] [CrossRef]
Le, T.H.; Chen, H.; Babar, M.A. A survey on data-driven software vulnerability assessment and prioritization. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
Das, S.; Shah, C. Contextual Code Completion Using Machine Learning; Technical Report; Stanford University: Stanford, CA, USA, 2015. [Google Scholar]
Moradi Dakhel, A.; Majdinasab, V.; Nikanjam, A.; Khomh, F.; Desmarais, M.C.; Jiang, Z.M.J. GitHub Copilot AI pair programmer: Asset or Liability? J. Syst. Softw. 2023, 203, 111734. [Google Scholar] [CrossRef]
Ouyang, S.; Zhang, J.; Sun, Z.; Merono Penuela, A. Knowledge-Enhanced Program Repair for Data Science Code. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; pp. 898–910. [Google Scholar] [CrossRef]
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.B.; Drain, D.; Jiang, D.; Tang, D.; et al. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
Brereton, P.; Kitchenham, B.A.; Budgen, D.; Turner, M.; Khalil, M. Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 2007, 80, 571–583. [Google Scholar] [CrossRef]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Version 2.3 EBSE Technical Report; EBSE: Rio de Janeiro, Brazil, 2007. [Google Scholar]
Watson, C.; Cooper, N.; Nader-Palacio, D.; Moran, K.; Poshyvanyk, D. A Systematic Literature Review on the Use of Deep Learning in Software Engineering Research. ACM Trans. Softw. Eng. Methodol. 2022, 31, 1–58. [Google Scholar] [CrossRef]
Cao, S.; Sun, X.; Widyasari, R.; Lo, D.; Wu, X.; Bo, L.; Zhang, J.; Li, B.; Liu, W.; Wu, D.; et al. A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research. arXiv 2025, arXiv:2401.14617. [Google Scholar]
Le, T.H.M.; Chen, H.; Babar, M.A. Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges. ACM Comput. Surv. 2020, 53, 1–38. [Google Scholar] [CrossRef]
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–79. [Google Scholar] [CrossRef]
Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large language models for software engineering: Survey and open problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53. [Google Scholar]
Wu, B.; Zou, F. Code vulnerability detection based on deep sequence and graph models: A survey. Secur. Commun. Netw. 2022, 2022, 62–73. [Google Scholar] [CrossRef]
Dou, S.; Shan, J.; Jia, H.; Deng, W.; Xi, Z.; He, W.; Wu, Y.; Gui, T.; Liu, Y.; Huang, X. Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey. arXiv 2023, arXiv:2308.01191. [Google Scholar] [CrossRef]
Zhong, W.; Li, C.; Ge, J.; Luo, B. Neural program repair: Systems, challenges and solutions. In Proceedings of the 13th Asia-Pacific Symposium on Internetware, Hohhot, China, 11–12 June 2022; pp. 96–106. [Google Scholar]
Huang, K.; Xu, Z.; Yang, S.; Sun, H.; Li, X.; Yan, Z.; Zhang, Y. A Survey on Automated Program Repair Techniques. arXiv 2023, arXiv:2303.18184. [Google Scholar] [CrossRef]
Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software testing with large language models: Survey, landscape, and vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
Katsogiannis-Meimarakis, G.; Koutrika, G. A survey on deep learning approaches for text-to-SQL. VLDB J. 2023, 32, 905–936. [Google Scholar] [CrossRef]
Grazia, L.D.; Pradel, M. Code Search: A Survey of Techniques for Finding Code. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
Zheng, Z.; Ning, K.; Wang, Y.; Zhang, J.; Zheng, D.; Ye, M.; Chen, J. A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends. arXiv 2024, arXiv:2311.10372. [Google Scholar] [CrossRef]
Zan, D.; Chen, B.; Zhang, F.; Lu, D.; Wu, B.; Guan, B.; Yongji, W.; Lou, J.G. Large language models meet NL2Code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON，Canada, 9–14 July 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 7443–7464. [Google Scholar]
Wong, M.F.; Guo, S.; Hang, C.N.; Ho, S.W.; Tan, C.W. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Ning, K.; Zhong, Q.; Chen, J.; Chen, W.; Guo, L.; Wang, W.; Wang, Y. Towards an understanding of large language models in software engineering tasks. Empir. Softw. Eng. 2024, 30, 50. [Google Scholar] [CrossRef]
Xu, Y.; Zhu, Y. A Survey on Pretrained Language Models for Neural Code Intelligence. arXiv 2022, arXiv:2212.10079. [Google Scholar] [CrossRef]
Batarseh, F.A.; Mohod, R.; Kumar, A.; Bui, J. The application of artificial intelligence in software engineering: A review challenging conventional wisdom. Data Democr. 2020, 179–232. [Google Scholar] [CrossRef]
Song, X.; Sun, H.; Wang, X.; Yan, J. A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques. IEEE Access 2019, 7, 111411–111428. [Google Scholar] [CrossRef]
Ahmed, A.; Azab, S.S.; Abdelhamid, Y. Source-Code Generation Using Deep Learning: A Survey. In Progress in Artificial Intelligence—22nd EPIA Conference on Artificial Intelligence, EPIA 2023, Faial Island, Azores, September 5–8, 2023, Proceedings, Part II; Lecture Notes in Computer Science; Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14116, pp. 467–482. [Google Scholar] [CrossRef]
Chen, X.; Xue, J.; Xie, X.; Liang, C.; Ju, X. A Systematic Literature Review on Neural Code Translation. arXiv 2025, arXiv:2505.07425. [Google Scholar] [CrossRef]
Lei, M.; Li, H.; Li, J.; Aundhkar, N.; Kim, D. Deep learning application on code clone detection: A review of current knowledge. J. Syst. Softw. 2022, 184, 111141. [Google Scholar] [CrossRef]
Chen, X.; Hu, X.; Huang, Y.; Jiang, H.; Ji, W.; Jiang, Y.; Jiang, Y.; Liu, B.; Liu, H.; Li, X.; et al. Deep learning-based software engineering: Progress, challenges, and opportunities. Sci. China Inf. Sci. 2025, 68, 111102. [Google Scholar] [CrossRef]
Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. ACM Trans. Softw. Eng. Methodol. 2025, accepted. [Google Scholar] [CrossRef]
Mathai, A.; Sedamaki, K.; Das, D.; Mathews, N.S.; Tamilselvam, S.; Chimalakonda, S.; Kumar, A. CodeSAM: Source Code Representation Learning by Infusing Self-Attention with Multi-Code-View Graphs. arXiv 2024, arXiv:2411.14611. [Google Scholar]
Sharma, T.; Kechagia, M.; Georgiou, S.; Tiwari, R.; Sarro, F. A Survey on Machine Learning Techniques for Source Code Analysis; Elsevier Science Inc.: Amsterdam, The Netherlands, 2024. [Google Scholar]
Xiao, Y.; Zuo, X.; Lu, X.; Dong, J.S.; Cao, X.; Beschastnikh, I. Promises and perils of using Transformer-based models for SE research. Neural Netw. 2024, 184, 107067. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Conneau, A.; Lample, G. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 7059–7069. [Google Scholar]
Liu, F.; Li, G.; Zhao, Y.; Jin, Z. Multi-task learning based pre-trained language model for code completion. In ASE’20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering; Association for Computing Machinery: New York, NY, USA, 2020; pp. 473–485. [Google Scholar] [CrossRef]
Jin, D.; Jin, Z.; Hu, Z.; Vechtomova, O.; Mihalcea, R. Deep learning for text style transfer: A survey. Comput. Linguist. 2022, 48, 155–205. [Google Scholar] [CrossRef]
Pham, Q.N.; Waibel, A.; Niehues, J. Adaptive multilingual speech recognition with pretrained models. In Proceedings of the Conference of the International Speech Communication Association, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Shujie, L.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
Laurençon, H.; Saulnier, L.; Wang, T.; Akiki, C.; del Moral, A.V.; Scao, T.L.; von Werra, L.; Mou, C.; Ponferrada, E.G.; Nguyen, H.; et al. The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset. Adv. Neural Inf. Process. Syst. 2022, 35, 31809–31826. [Google Scholar] [CrossRef]
Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. Learning and evaluating contextual embedding of source code. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5110–5121. [Google Scholar]
Hägglund, M.; Pena, F.J.; Pashami, S.; Al-Shishtawy, A.; Payberah, A.H. COCLUBERT: Clustering Machine Learning Source Code. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 151–158. [Google Scholar]
Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1536–1547. [Google Scholar] [CrossRef]
Guo, D.; Lu, S.; Duan, N.; Wang, Y.; Zhou, M.; Yin, J. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Guo, D.; Xu, C.; Duan, N.; Yin, J.; McAuley, J. LongCoder: A long-range pre-trained language model for code completion. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 12098–12107. [Google Scholar]
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Le Scao, T.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual Generalization through Multitask Finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 15991–16111. [Google Scholar]
Roziere, B.; Lachaux, M.A.; Chanussot, L.; Lample, G. Unsupervised translation of programming languages. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Liu, X.; Xv, L. Abstract summarization based on the combination of transformer and LSTM. In Proceedings of the 2019 International Conference on Intelligent Computing, Automation and Systems (ICICAS), Chongqing, China, 6–8 December 2019; pp. 923–927. [Google Scholar]
Aggarwal, K.; Salameh, M.; Hindle, A. Using machine translation for converting Python 2 to Python 3 code. PeerJ 2015, 3, e1459v1. [Google Scholar] [CrossRef]
Gu, X.; Zhang, H.; Zhang, D.; Kim, S. DeepAM: Migrate APIs with multi-modal sequence to sequence learning. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; AAAI Press: Washington, DC, USA, 2017; pp. 3675–3681. [Google Scholar]
Takizawa, H.; Hirasawa, S.; Hayashi, Y.; Egawa, R.; Kobayashi, H. Xevolver: An XML-based code translation framework for supporting HPC application migration. In Proceedings of the 2014 21st International Conference on High Performance Computing (HiPC), Goa, India, 17–20 December 2014; pp. 1–11. [Google Scholar] [CrossRef]
Bhatia, S.; Qiu, J.; Hasabnis, N.; Seshia, S.A.; Cheung, A. Verified Code Transpilation with LLMs. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
Gebreslassie, M.G.; Ji, S.; Roh, M.; Im, H. Leveraging QLoRA on Code Large Language Models for Multilingual Code Translation. KIISE Trans. Comput. Pract. 2025, 31, 152–157. [Google Scholar] [CrossRef]
C2Rust—Migrate C code to Rust. GitHub Repository. 2023. Available online: https://github.com/immunant/c2rust.git (accessed on 9 October 2025).
J2cstranslator—Java to CSharp Translator. SourceForge Project. 2013. Available online: https://sourceforge.net/projects/j2cstranslator/ (accessed on 9 October 2025).
Balogh, G.; Mudalige, G.; Reguly, I.; Antao, S.; Bertolli, C. OP2-Clang: A Source-to-Source Translator Using Clang/LLVM LibTooling. In Proceedings of the 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Dallas, TX, USA, 12 November 2018; pp. 59–70. [Google Scholar] [CrossRef]
Hung-Cuong, N.; Quyet-Thang, H.; Ba-Vuong, T. Rule-Based Techniques Using Abstract Syntax Tree for Code Optimization and Secure Programming in Java. In Context-Aware Systems and Applications; Vinh, P.C., Alagar, V., Vassev, E., Khare, A., Eds.; Springer: Cham, Switzerland, 2014; pp. 168–177. [Google Scholar]
Nguyen, A.T.; Nguyen, T.T.; Nguyen, T.N. Lexical statistical machine translation for language migration. In ESEC/FSE 2013: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering; Association for Computing Machinery: New York, NY, USA, 2013; pp. 651–654. [Google Scholar] [CrossRef]
Karaivanov, S.; Raychev, V.; Vechev, M. Phrase-Based Statistical Translation of Programming Languages. In Onward! 2014, Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software; ACM: New York, NY, USA, 2014; pp. 173–184. [Google Scholar] [CrossRef]
Nguyen, T.T.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. A statistical semantic language model for source code. In ESEC/FSE 2013: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering; Association for Computing Machinery: New York, NY, USA, 2013; pp. 532–542. [Google Scholar] [CrossRef]
Roziere, B.; Zhang, J.M.; Charton, F.; Harman, M.; Synnaeve, G.; Lample, G. Leveraging Automated Unit Tests for Unsupervised Code Translation. arXiv 2022, arXiv:2110.06773. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Zheng, Y.; Zhang, Y.; Zhao, Y.; Huang, R.; Zhou, J.; Yang, Y.; Zheng, T.; Chen, Z. Improving Deep Assertion Generation via Fine-Tuning Retrieval-Augmented Pre-trained Language Models. ACM Trans. Softw. Eng. Methodol. 2025, accepted. [Google Scholar] [CrossRef]
Ahmad, W.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2655–2668. [Google Scholar] [CrossRef]
Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar] [CrossRef]
To, H.; Nguyen, M.; Bui, N. Functional Overlap Reranking for Neural Code Generation. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3686–3704. [Google Scholar] [CrossRef]
Miceli-Barone, A.V.; Sennrich, R. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 27 November 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 314–319. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Rabinovich, M.; Stern, M.; Klein, D. Abstract Syntax Networks for Code Generation and Semantic Parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1139–1149. [Google Scholar]
Amodio, M.; Chaudhuri, S.; Reps, T.W. Neural Attribute Machines for Program Generation. arXiv 2021, arXiv:1705.09231. [Google Scholar] [CrossRef]
Murali, V.; Qi, L.; Chaudhuri, S.; Jermaine, C. Neural Sketch Learning for Conditional Program Generation. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Chen, B.; Zhang, F.; Nguyen, A.; Zan, D.; Lin, Z.; Lou, J.G.; Chen, W. CodeT: Code Generation with Generated Tests. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, J.; Xia, C.S.; Wang, Y.; Zhang, L. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Storhaug, A.; Li, J.; Hu, T. Efficient avoidance of vulnerabilities in auto-completed smart contract code using vulnerability-constrained decoding. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 683–693. [Google Scholar]
Guo, D.; Zhu, Q.; Yang, D.; Xie, Z.; Dong, K.; Zhang, W.; Chen, G.; Bi, X.; Wu, Y.; Li, Y.K.; et al. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv 2024, arXiv:2401.14196. [Google Scholar]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Remez, T.; Rapin, J.; et al. Code Llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Wang, D.; Guo, Y.; Dong, W.; Wang, Z.; Liu, H.; Li, S. Deep Code-Comment Understanding and Assessment. IEEE Access 2019, 7, 174200–174209. [Google Scholar] [CrossRef]
Mastropaolo, A.; Aghajani, E.; Pascarella, L.; Bavota, G. An Empirical Study on Code Comment Completion. In Proceedings of the 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Bogotá, Colombia, 1–6 October 2021; pp. 159–170. [Google Scholar]
Shiina, H.; Onishi, S.; Takahashi, A.; Kobayashi, N. Automatic Comment Generation for Source Code Using External Information by Neural Networks for Computational Thinking. Int. J. Smart Comput. Artif. Intell. 2020, 4, 39–61. [Google Scholar] [CrossRef]
Haije, T.; Intelligentie, B.O.K.; Gavves, E.; Heuer, H. Automatic comment generation using a neural translation model. Inf. Softw. Technol 2016, 55, 258–268. [Google Scholar]
Gros, D.; Sezhiyan, H.; Devanbu, P.; Yu, Z. Code to comment “translation” data, metrics, baselining & evaluation. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, 21–25 September 2020; pp. 746–757. [Google Scholar]
Katz, O.; Olshaker, Y.; Goldberg, Y.; Yahav, E. Towards neural decompilation. arXiv 2019, arXiv:1905.08325. [Google Scholar] [CrossRef]
Liang, R.; Cao, Y.; Hu, P.; Chen, K. Neutron: An attention-based neural decompiler. Cybersecurity 2021, 4, 5–17. [Google Scholar] [CrossRef]
Hosseini, I.; Dolan-Gavitt, B. Beyond the C: Retargetable Decompilation using Neural Machine Translation. In Proceedings of the Network and Distributed System Security (NDSS) Symposium, San Diego, CA, USA, 24–28 April 2022. [Google Scholar]
Szafraniec, M.; Roziere, B.; Leather, H.J.; Labatut, P.; Charton, F.; Synnaeve, G. Code Translation with Compiler Representations. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023; pp. 1–20. [Google Scholar]
Gu, X.; Zhang, H.; Kim, S. Deep code search. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), Gothenburg, Sweden, 27 May–3 June 2018; pp. 933–944. [Google Scholar]
Shuai, J.; Xu, L.; Liu, C.; Yan, M.; Xia, X.; Lei, Y. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 13–15 July 2020; pp. 196–207. [Google Scholar]
Mathew, G.; Stolee, K.T. Cross-language code search using static and dynamic analyses. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 205–217. [Google Scholar]
Ling, X.; Wu, L.; Wang, S.; Pan, G.; Ma, T.; Xu, F.; Liu, A.X.; Wu, C.; Ji, S. Deep graph matching and searching for semantic code retrieval. ACM Trans. Knowl. Discov. Data 2021, 15, 1–21. [Google Scholar] [CrossRef]
Van Nguyen, T.; Nguyen, A.T.; Phan, H.D.; Nguyen, T.D.; Nguyen, T.N. Combining Word2Vec with Revised Vector Space Model for Better Code Retrieval. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), Buenos Aires, Argentina, 20–28 May 2017; pp. 183–185. [Google Scholar] [CrossRef]
Wan, Y.; Shu, J.; Sui, Y.; Xu, G.; Zhao, Z.; Wu, J.; Yu, P. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), San Diego, CA, USA, 11–15 November 2019; pp. 13–25. [Google Scholar]
de Rezende Martins, M.; Gerosa, M.A. CoNCRA: A Convolutional Neural Networks Code Retrieval Approach. In Proceedings of the 34th Brazilian Symposium on Software Engineering, SBES 2020, Natal, Brazil, 19–23 October 2020; Cavalcante, E., Dantas, F., Batista, T., Eds.; ACM: New York, NY, USA, 2020; pp. 526–531. [Google Scholar] [CrossRef]
Li, W.; Xu, J.; Chen, Q. Knowledge Distillation-Based Multilingual Fusion Code Retrieval. Algorithms 2022, 15, 25. [Google Scholar] [CrossRef]
Zhang, F.; Chen, B.; Zhang, Y.; Liu, J.; Zan, D.; Mao, Y.; Lou, J.G.; Chen, W. RepoCoder: Repository-level code completion through iterative retrieval and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Rahman, M.; Watanobe, Y.; Nakamura, K. A neural network based intelligent support model for program code completion. Sci. Program. 2020, 2020, 1–8. [Google Scholar] [CrossRef]
Svyatkovskiy, A.; Deng, S.K.; Fu, S.; Sundaresan, N. IntelliCode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual, 8–13 November 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1433–1443. [Google Scholar]
Zhou, W.; Kim, S.; Murali, V.; Aye, G.A. Improving code autocompletion with transfer learning. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, Pittsburgh, PA, USA, 8–20 May 2022; pp. 161–162. [Google Scholar]
Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Li, J.; Li, G.; Li, Z.; Jin, Z.; Hu, X.; Zhang, K.; Fu, Z. CodeEditor: Learning to edit source code with pre-trained models. ACM Trans. Softw. Eng. Methodol. 2023, 32, 1–22. [Google Scholar] [CrossRef]
Gao, Y.; Lyu, C. M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization. In Proceedings of the 30th International Conference on Program Comprehension, Virtual Event, 16–17 May 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
Wang, W.; Zhang, Y.; Sui, Y.; Wan, Y.; Zhao, Z.; Wu, J.; Yu, P.S.; Xu, G. Reinforcement-Learning-Guided Source Code Summarization Using Hierarchical Attention. IEEE Trans. Softw. Eng. 2022, 48, 102–119. [Google Scholar] [CrossRef]
Wang, Y.; Dong, Y.; Lu, X.; Zhou, A. GypSum: Learning Hybrid Representations for Code Summarization. In Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC), Pittsburgh, PA, USA, 16–17 May 2022; pp. 12–23. [Google Scholar]
Tang, Z.; Shen, X.; Li, C.; Ge, J.; Huang, L.; Zhu, Z.; Luo, B. AST-Trans: Code Summarization with Efficient Tree-Structured Attention. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 150–162. [Google Scholar]
Haque, S.; Bansal, A.; Wu, L.; McMillan, C. Action word prediction for neural source code summarization. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 9–12 March 2021; pp. 330–341. [Google Scholar]
Siow, J.K.; Gao, C.; Fan, L.; Chen, S.; Liu, Y. CORE: Automating review recommendation for code changes. In Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), London, ON, Canada, 18–21 February 2020; pp. 284–295. [Google Scholar]
Hoang, T.; Kang, H.J.; Lo, D.; Lawall, J. CC2Vec: Distributed representations of code changes. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 518–529. [Google Scholar]
Pathik, B.; Sharma, M. Source code change analysis with deep learning based programming model. Autom. Softw. Eng. 2022, 29, 1–25. [Google Scholar] [CrossRef]
Tufano, M.; Pantiuchina, J.; Watson, C.; Bavota, G.; Poshyvanyk, D. On learning meaningful code changes via neural machine translation. In Proceedings of the 41st International Conference on Software Engineering, Montreal, QC, Canada, 25–31 May 2019; pp. 25–36. [Google Scholar] [CrossRef]
Zhang, L.; Feng, Z.; Ren, W.; Luo, H. Siamese-Based BiLSTM Network for Scratch Source Code Similarity Measuring. In Proceedings of the 2020 International Wireless Communications and Mobile Computing (IWCMC), Limassol, Cyprus, 15–19 June 2020; pp. 1800–1805. [Google Scholar]
Zhang, F.; Li, G.; Liu, C.; Song, Q. Flowchart-based cross-language source code similarity detection. Sci. Program. 2020, 2020, 1–15. [Google Scholar] [CrossRef]
Fried, D.; Aghajanyan, A.; Lin, J.; Wang, S.; Wallace, E.; Shi, F.; Zhong, R.; Yih, S.; Zettlemoyer, L.; Lewis, M. InCoder: A Generative Model for Code Infilling and Synthesis. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ji, S.; Choi, S.; Ko, S.; Kim, D.; Im, H. RepCoder: An automated program repair framework for probability-based program synthesis. In Proceedings of the SAC ’22: The 37th ACM/SIGAPP Symposium on Applied Computing, Virtual Event, 25–29 April 2022; Hong, J., Bures, M., Park, J.W., Cerný, T., Eds.; ACM: New York, NY, USA, 2022; pp. 1554–1561. [Google Scholar] [CrossRef]
Shin, R.; Allamanis, M.; Brockschmidt, M.; Polozov, O. Program synthesis and semantic parsing with learned code idioms. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 10825–10835. [Google Scholar]
Chen, X.; Song, D.; Tian, Y. Latent execution for neural program synthesis. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021; Curran Associates Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Shin, R.; Kant, N.; Gupta, K.; Bender, C.; Trabucco, B.; Singh, R.; Song, D. Synthetic Datasets for Neural Program Synthesis. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chen, X.; Liu, C.; Song, D. Towards synthesizing complex programs from input-output examples. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wang, X.; Wang, Y.; Wan, Y.; Wang, J.; Zhou, P.; Li, L.; Wu, H.; Liu, J. CODE-MVP: Learning to represent source code from multiple views with contrastive pre-training. In Findings of the Association for Computational Linguistics: NAACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Chakraborty, S.; Ahmed, T.; Ding, Y.; Devanbu, P.T.; Ray, B. NatGen: Generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual, 14–16 November 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 18–30. [Google Scholar]
Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, 18–22 June 2018; Foster, J.S., Grossman, D., Eds.; ACM: New York, NY, USA, 2018; pp. 404–419. [Google Scholar] [CrossRef]
Ben-Nun, T.; Jakobovits, A.S.; Hoefler, T. Neural code comprehension: A learnable representation of code semantics. In NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 3589–3601. [Google Scholar]
Brockschmidt, M.; Allamanis, M.; Gaunt, A.L.; Polozov, O. Generative Code Modeling with Graphs. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Karampatsis, R.M.; Sutton, C. Maybe deep neural networks are the best choice for modeling source code. arXiv 2019, arXiv:1903.05734. [Google Scholar] [CrossRef]
Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. In ICSE ’19: Proceedings of the 41st International Conference on Software Engineering; IEEE Press: New York, NY, USA, 2019; pp. 783–794. [Google Scholar] [CrossRef]
Sui, Y.; Cheng, X.; Zhang, G.; Wang, H. Flow2Vec: Value-flow-based precise code embedding. Proc. ACM Program. Lang. 2020, 4, 233. [Google Scholar] [CrossRef]
Zhang, J.; Hong, H.; Zhang, Y.; Wan, Y.; Liu, Y.; Sui, Y. Disentangled Code Representation Learning for Multiple Programming Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4454–4466. [Google Scholar] [CrossRef]
Niu, C.; Li, C.; Ng, V.; Ge, J.; Huang, L.; Luo, B. SPT-code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 2006–2018. [Google Scholar]
Alreshedy, K.; Dharmaretnam, D.; German, D.M.; Srinivasan, V.; Gulliver, T.A. SCC: Automatic classification of code snippets. In Proceedings of the 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM), Madrid, Spain, 23–24 September 2018. [Google Scholar]
Reyes, J.; Ramírez, D.; Paciello, J. Automatic classification of source code archives by programming language: A deep learning approach. In Proceedings of the 2016 international conference on computational science and computational intelligence (CSCI), Las Vegas, NV, USA, 15–17 December 2016; pp. 514–519. [Google Scholar]
Gilda, S. Source code classification using Neural Networks. In Proceedings of the 2017 14th international joint conference on computer science and software engineering (JCSSE), Nakhon Si Thammarat, Thailand, 12–14 July 2017; pp. 1–6. [Google Scholar]
Ohashi, H.; Watanobe, Y. Convolutional neural network for classification of source codes. In Proceedings of the 2019 IEEE 13th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Singapore, 1–4 October 2019; pp. 194–200. [Google Scholar]
Ifham, M.; Kumara, B.S.; Ekanayaka, E.B. Neural Network-based Approach for Source Code Classification to Enhance Software Maintainability and Reusability. In Proceedings of the 2021 From Innovation To Impact (FITI), Colombo, Sri Lanka, 8 December 2021; Volume 1, pp. 1–6. [Google Scholar]
Barr, J.R.; Shaw, P.; Abu-Khzam, F.N.; Yu, S.; Yin, H.; Thatcher, T. Combinatorial code classification & vulnerability rating. In Proceedings of the 2020 Second International Conference on Transdisciplinary AI (TransAI), Irvine, CA, USA, 21–23 September 2020; pp. 80–83. [Google Scholar]
Guseva, P.; Drozdova, A.; Denisenko, N.; Sapozhnikova, D.; Pyaternev, I.; Scherbakova, A.; Ustuzhanin, A. Semantic Code Classification for Automated Machine Learning. arXiv 2022, arXiv:2201.11252. [Google Scholar] [CrossRef]
Russell, R.; Kim, L.; Hamilton, L.; Lazovich, T.; Harer, J.; Ozdemir, O.; Ellingwood, P.; McConley, M. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 757–762. [Google Scholar]
An, W.; Chen, L.; Wang, J.; Du, G.; Shi, G.; Meng, D. AVDHRAM: Automated vulnerability detection based on hierarchical representation and attention mechanism. In Proceedings of the 2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Exeter, UK, 17–19 December 2020; pp. 337–344. [Google Scholar]
Mao, Y.; Li, Y.; Sun, J.; Chen, Y. Explainable Software vulnerability detection based on Attention-based Bidirectional Recurrent Neural Networks. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Virtual, 10–13 December 2020; pp. 4651–4656. [Google Scholar]
Wu, T.; Chen, L.; Du, G.; Zhu, C.; Shi, G. Self-attention based automated vulnerability detection with effective data representation. In Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York, NY, USA, 30 September–3 October 2021; pp. 892–899. [Google Scholar]
Zheng, W.; Semasaba, A.O.A.; Wu, X.; Agyemang, S.A.; Liu, T.; Ge, Y. Representation vs. Model: What Matters Most for Source Code Vulnerability Detection. In Proceedings of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2021, Honolulu, HI, USA, 9–12 March 2021; IEEE: New York, NY, USA, 2021; pp. 647–653. [Google Scholar] [CrossRef]
Wu, T.; Chen, L.; Du, G.; Zhu, C.; Cui, N.; Shi, G. Inductive Vulnerability Detection via Gated Graph Neural Network. In Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Hangzhou, China, 4–6 May 2022; pp. 519–524. [Google Scholar]
Wang, H.; Ye, G.; Tang, Z.; Tan, S.H.; Huang, S.; Fang, D.; Feng, Y.; Bian, L.; Wang, Z. Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection. IEEE Trans. Inf. Forensics Secur. 2021, 16, 1943–1958. [Google Scholar] [CrossRef]
Du, X.; Wen, M.; Zhu, J.; Xie, Z.; Ji, B.; Liu, H.; Shi, X.; Jin, H. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. In Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Bangkok, Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 10507–10521. [Google Scholar] [CrossRef]
Wei, J.; Durrett, G.; Dillig, I. TypeT5: Seq2seq Type Inference using Static Analysis. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ramu, R.; Upadhyaya, G.; Nguyen, H.A.; Rajan, H. Hybrid traversal: Efficient source code analysis at scale. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, Gothenburg, Sweden, 27 May–3 June 2018; pp. 412–413. [Google Scholar]
Ramadan, T.; Islam, T.Z.; Phelps, C.; Pinnow, N.; Thiagarajan, J.J. Comparative Code Structure Analysis using Deep Learning for Performance Prediction. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Stony Brook, NY, USA, 28–30 March 2021; pp. 151–161. [Google Scholar] [CrossRef]
Mezhuev, P.; Gerasimov, A.; Privalov, P.; Butkevich, V. A dynamic algorithm for source code static analysis. In Proceedings of the 2021 Ivannikov Memorial Workshop (IVMEM), Nizhny Novgorod, Russian Federation, 24–25 September 2021; pp. 57–60. [Google Scholar]
Qayum, A.; Khan, S.U.R.; Inayat-Ur-Rehman; Akhunzada, A. FineCodeAnalyzer: Multi-Perspective Source Code Analysis Support for Software Developer Through Fine-Granular Level Interactive Code Visualization. IEEE Access 2022, 10, 20496–20513. [Google Scholar] [CrossRef]
Sargsyan, S.; Vardanyan, V.; Aslanyan, H.; Harutunyan, M.; Mehrabyan, M.; Sargsyan, K.; Hovahannisyan, H.; Movsisyan, H.; Hakobyan, J.; Kurmangaleev, S. GENES ISP: Code analysis platform. In Proceedings of the 2020 Ivannikov Ispras Open Conference (ISPRAS), Moscow, Russia, 10–11 December 2020; pp. 35–39. [Google Scholar]
Kurtukova, A.; Romanov, A.; Shelupanov, A.; Fedotova, A. Complex Cases of Source Code Authorship Identification Using a Hybrid Deep Neural Network. Future Internet 2022, 14, 287. [Google Scholar] [CrossRef]
Abuhamad, M.; Rhim, J.s.; AbuHmed, T.; Ullah, S.; Kang, S.; Nyang, D. Code authorship identification using convolutional neural networks. Future Gener. Comput. Syst. 2019, 95, 104–115. [Google Scholar] [CrossRef]
Kurtukova, A.; Romanov, A.; Shelupanov, A. Source Code Authorship Identification Using Deep Neural Networks. Symmetry 2020, 12, 2044. [Google Scholar] [CrossRef]
Omi, A.M.; Hossain, M.; Islam, M.N.; Mittra, T. Multiple Authors Identification from Source Code Using Deep Learning Model. In Proceedings of the 2021 International Conference on Electronics, Communications and Information Technology (ICECIT), Khulna, Bangladesh, 14–16 September 2021; pp. 1–4. [Google Scholar]
Bogdanova, A.; Romanov, V. Explainable source code authorship attribution algorithm. J. Phy. Conf. Ser. 2021, 2134, 012011. [Google Scholar] [CrossRef]
Lutellier, T.; Pang, L.; Pham, V.H.; Wei, M.; Tan, L. ENCORE: Ensemble learning using convolution neural machine translation for automatic program repair. arXiv 2019, arXiv:1906.08691. [Google Scholar]
Liu, G.; Lu, Y.; Shi, K.; Chang, J.; Wei, X. Convolutional neural networks-based locating relevant buggy code files for bug reports affected by data imbalance. IEEE Access 2019, 7, 131304–131316. [Google Scholar] [CrossRef]
Jiang, N.; Lutellier, T.; Tan, L. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In ICSE ’21: Proceedings of the 43rd International Conference on Software Engineering; IEEE Press: New York, NY, USA, 2021; pp. 1161–1173. [Google Scholar] [CrossRef]
Yuan, W.; Zhang, Q.; He, T.; Fang, C.; Hung, N.Q.V.; Hao, X.; Yin, H. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Daejeon, Republic of Korea, 18–22 July 2022; pp. 678–690. [Google Scholar] [CrossRef]
Charalambous, Y.; Tihanyi, N.; Jain, R.; Sun, Y.; Ferrag, M.A.; Cordeiro, L.C. A new era in software security: Towards self-healing software via large language models and formal verification. arXiv 2023, arXiv:2305.14752. [Google Scholar] [CrossRef]
Ji, S.; Lee, S.; Lee, C.; Han, Y.; Im, H. Impact of Large Language Models of Code on Fault Localization. In Proceedings of the 2025 IEEE Conference on Software Testing, Verification and Validation (ICST), Naples, Italy, 31 March–4 April 2025; pp. 302–313. [Google Scholar] [CrossRef]
Dişli, H.; Tosun, A. Code Clone Detection with Convolutional Neural Networks. BilişIm Teknol. Derg. 2020, 13, 1–12. [Google Scholar] [CrossRef]
Zhang, A.; Liu, K.; Fang, L.; Liu, Q.; Yun, X.; Ji, S. Learn to align: A code alignment network for code clone detection. In Proceedings of the 2021 28th Asia-Pacific Software Engineering Conference (APSEC), Taipei, Taiwan, 6–9 December 2021; pp. 1–11. [Google Scholar]
Zeng, J.; Ben, K.; Li, X.; Zhang, X. Fast code clone detection based on weighted recursive autoencoders. IEEE Access 2019, 7, 125062–125078. [Google Scholar] [CrossRef]
Hua, W.; Sui, Y.; Wan, Y.; Liu, G.; Xu, G. FCCA: Hybrid Code Representation for Functional Clone Detection Using Attention Networks. IEEE Trans. Reliab. 2021, 70, 304–318. [Google Scholar] [CrossRef]
Yahya, M.A.; Kim, D.K. CLCD-I: Cross-Language Clone Detection by Using Deep Learning with InferCode. Computers 2023, 12, 12. [Google Scholar] [CrossRef]
Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, 9–13 November 2015; Cohen, M.B., Grunske, L., Whalen, M., Eds.; IEEE Computer Society: New York, NY, USA, 2015; pp. 574–584. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, Y.; Yang, G.; Chen, T. Syntax-Aware Retrieval Augmented Code Generation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1291–1302. [Google Scholar] [CrossRef]
Yin, P.; Deng, B.; Chen, E.; Vasilescu, B.; Neubig, G. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15thInternational Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, 28–29 May 2018; pp. 476–486. [Google Scholar]
Weyssow, M.; Zhou, X.; Kim, K.; Lo, D.; Sahraoui, H. Exploring parameter-efficient fine-tuning techniques for code generation with large language models. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–25. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, S.; Siow, J.K.; Du, X.; Liu, Y. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 10197–10207. [Google Scholar]
Wang, Z.; Zhou, S.; Fried, D.; Neubig, G. Execution-Based Evaluation for Open-Domain Code Generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1271–1290. [Google Scholar] [CrossRef]
Wang, L.; Zhang, A.; Wu, K.; Sun, K.; Li, Z.; Wu, H.; Zhang, M.; Wang, H. DuSQL: A large-scale and pragmatic Chinese text-to-SQL dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6923–6935. [Google Scholar]
Li, R.; Fu, J.; Zhang, B.W.; Huang, T.; Sun, Z.; Lyu, C.; Liu, G.; Jin, Z.; Li, G. TACO: Topics in algorithmic code generation dataset. arXiv 2023, arXiv:2312.14852. [Google Scholar] [CrossRef]
Jain, N.; Han, K.; Gu, A.; Li, W.D.; Yan, F.; Zhang, T.; Wang, S.; Solar-Lezama, A.; Sen, K.; Stoica, I. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. In Proceedings of the International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Daghighfarsoodeh, A.; Wang, C.Y.; Taherkhani, H.; Sepidband, M.; Abdollahi, M.; Hemmati, H.; Pham, H.V. Deep-Bench: Deep Learning Benchmark Dataset for Code Generation. arXiv 2025, arXiv:2502.18726. [Google Scholar] [CrossRef]
Oda, Y.; Neubig, G.; Sakti, S.; Toda, T.; Nakamura, S. Syntax-based simultaneous translation through prediction of unseen syntactic constituents. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 198–207. [Google Scholar]
Kocetkov, D.; Li, R.; Ben Allal, L.; Li, J.; Mou, C.; Muñoz Ferrandis, C.; Jernite, Y.; Mitchell, M.; Hughes, S.; Wolf, T.; et al. The Stack: 3 TB of permissively licensed source code. Trans. Mach. Learn. Res. 2023, accepted. Available online: https://arxiv.org/abs/2211.15533 (accessed on 9 October 2025).
Lozhkov, A.; Li, R.; Allal, L.B.; Cassano, F.; Lamy-Poirier, J.; Tazi, N.; Tang, A.; Pykhtar, D.; Liu, J.; Wei, Y.; et al. StarCoder 2 and The Stack v2: The Next Generation. arXiv 2024, arXiv:2402.19173. [Google Scholar] [CrossRef]
Rahman, M.; Khatoonabadi, S.; Shihab, E. A Large-scale Class-level Benchmark Dataset for Code Generation with LLMs. arXiv 2025, arXiv:2504.15564. [Google Scholar]
Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Lago, A.D.; et al. Competition-level code generation with AlphaCode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
Hugging Face. Hugging Face Datasets. 2025. Available online: https://huggingface.co/datasets (accessed on 9 October 2025).
Bouzenia, I.; Devanbu, P.; Pradel, M. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; pp. 2188–2200. [Google Scholar] [CrossRef]
Yin, X.; Ni, C.; Wang, S.; Li, Z.; Zeng, L.; Yang, X. ThinkRepair: Self-Directed Automated Program Repair. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, 16–20 September 2024; Christakis, M., Pradel, M., Eds.; ACM: New York, NY, USA, 2024; pp. 1274–1286. [Google Scholar] [CrossRef]
Li, F.; Jiang, J.; Sun, J.; Zhang, H. Evaluating the Generalizability of LLMs in Automated Program Repair. In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Results, ICSE 2025-NIER, Ottawa, ON, Canada, 27 April–3 May 2025; IEEE: New York, NY, USA, 2025; pp. 91–95. [Google Scholar] [CrossRef]
Nguyen, A.T.; Nguyen, T.T.; Nguyen, T.N. Divide-and-conquer approach for multi-phase statistical migration for source code. In ASE’15: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering; IEEE Press: New York, NY, USA, 2015; pp. 585–596. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Eghbali, A.; Pradel, M. CrystalBLEU: Precisely and Efficiently Measuring the Similarity of Code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 1–12. [Google Scholar] [CrossRef]
Tran, N.M.; Tran, H.; Nguyen, S.; Nguyen, H.; Nguyen, T.N. Does BLEU score work for code migration? In Proceedings of the 27th International Conference on Program Comprehension, ICPC 2019, Montreal, QC, Canada, 25–31 May 2019; Guéhéneuc, Y., Khomh, F., Sarro, F., Eds.; IEEE/ACM: New York, NY, USA, 2019; pp. 165–176. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, MI, USA, 29 June 2005; Goldstein, J., Lavie, A., Lin, C., Voss, C.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 65–72. [Google Scholar]
Haque, S.; Eberhart, Z.; Bansal, A.; McMillan, C. Semantic similarity metrics for evaluating source code summarization. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, Virtual Event, 16–17 May 2022; pp. 36–47. [Google Scholar]
Liu, Y.; Xu, C.; Zhou, Y.; Li, Z.; Xu, Q. DeepRTL: Bridging Verilog Understanding and Generation with a Unified Representation Model. In Proceedings of the Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, 24–28 April 2025. [Google Scholar]
Zhou, S.; Alon, U.; Agarwal, S.; Neubig, G. CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13921–13937. [Google Scholar] [CrossRef]
Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv 2020, arXiv:1909.09436. [Google Scholar]
Makharev, V.; Ivanov, V. Code Summarization Beyond Function Level. In Proceedings of the IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), Ottawa, ON, Canada, 3 May 2025; pp. 153–160. [Google Scholar] [CrossRef]
Agashe, R.; Iyer, S.; Zettlemoyer, L. JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 5436–5446. [Google Scholar] [CrossRef]
Abid, S.; Cai, X.; Jiang, L. Measuring model alignment for code clone detection using causal interpretation. Empir. Softw. Eng. 2025, 30, 46. [Google Scholar] [CrossRef]
Ding, Y.; Fu, Y.; Ibrahim, O.; Sitawarin, C.; Chen, X.; Alomair, B.; Wagner, D.; Ray, B.; Chen, Y. Vulnerability Detection with Code Language Models: How Far are We? In Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; pp. 1729–1741. [Google Scholar] [CrossRef]
Kheria, I.; Gada, D.; Karani, R. A Semisupervised Learning Approach for Code Smell Detection. SN Comput. Sci. 2025, 6, 143. [Google Scholar] [CrossRef]
He, X.; Asiya; Han, D.; Zhou, S.; Fu, X.; Li, H. An Improved Software Source Code Vulnerability Detection Method: Combination of Multi-Feature Screening and Integrated Sampling Model. Sensors 2025, 25, 1816. [Google Scholar] [CrossRef] [PubMed]
Kumar, H.; Saxena, V. Software Defect Prediction Using Hybrid Machine Learning Techniques: A Comparative Study. J. Softw. Eng. Appl. 2024, 17, 155–171. [Google Scholar] [CrossRef]
Rodriguez, M.; Popa, R.A.; Flynn, F.; Liang, L.; Dafoe, A.; Wang, A. A Framework for Evaluating Emerging Cyberattack Capabilities of AI. arXiv 2025, arXiv:2503.11917. [Google Scholar]
Atiiq, S.A.; Gehrmann, C.; Dahlén, K.; Khalil, K. From Generalist to Specialist: Exploring CWE-Specific Vulnerability Detection. arXiv 2024, arXiv:2408.02329. [Google Scholar] [CrossRef]
Wen, J.; Yuan, D.; Ma, L.; Chen, H. Code Ownership in Open-Source AI Software Security. In Proceedings of the 2nd International Workshop on Responsible AI Engineering, Lisbon, Portugal, 16 April 2024; pp. 28–35. [Google Scholar] [CrossRef]
Yang, F.; Wang, Y. Analyzing the Robustness of Complex Networks with Attack Success Rate. Entropy 2023, 25, 1508. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Wang, S.; Guo, D.; Chen, J.; Grundy, J.; Liu, X.; Ma, Y.; Mao, M.; Zhang, H.; et al. RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation. arXiv 2024, arXiv:2412.17744. [Google Scholar]
Xie, Y.; Naik, A.; Fried, D.; Rosé, C.P. Data Augmentation for Code Translation with Comparable Corpora and Multiple References. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13725–13739. [Google Scholar] [CrossRef]
Anjum Haque, M.M.; Ahmad, W.U.; Lourentzou, I.; Brown, C. FixEval: Execution-based Evaluation of Program Fixes for Programming Problems. In Proceedings of the IEEE/ACM International Workshop on Automated Program Repair (APR), Melbourne, Australia, 16 May 2023; pp. 11–18. [Google Scholar] [CrossRef]
Yan, W.; Tian, Y.; Li, Y.; Chen, Q.; Wang, W. CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 5067–5089. [Google Scholar] [CrossRef]
Ou, G.; Liu, M.; Chen, Y.; Du, X.; Wang, S.; Zhang, Z.; Peng, X.; Zheng, Z. Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-Augmented. arXiv 2025, arXiv:2503.18305. [Google Scholar] [CrossRef]
Dai, D.; Liu, M.; Li, A.; Cao, J.; Wang, Y.; Wang, C.; Peng, X.; Zheng, Z. FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks. arXiv 2025, arXiv:2504.06939. [Google Scholar]
Li, J.; Li, G.; Zhang, X.; Zhao, Y.; Dong, Y.; Jin, Z.; Li, B.; Huang, F.; Li, Y. EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024; Globersons, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J.M., Zhang, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Sridhara, G.; Pollock, L.L.; Vijay-Shanker, K. Automatically detecting and describing high level actions within methods. In Proceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Honolulu, HI, USA, 21–28 May 2011; Taylor, R.N., Gall, H.C., Medvidovic, N., Eds.; ACM: New York, NY, USA, 2011; pp. 101–110. [Google Scholar] [CrossRef]
Liu, H.; Gegov, A.; Cocea, M. Rule-based systems: A granular computing perspective. Granul. Comput. 2016, 1, 259–274. [Google Scholar] [CrossRef]
Sridhara, G.; Hill, E.; Muppaneni, D.; Pollock, L.L.; Vijay-Shanker, K. Towards automatically generating summary comments for Java methods. In Proceedings of the ASE 2010, 25th IEEE/ACM International Conference on Automated Software Engineering, Antwerp, Belgium, 20–24 September 2010; Pecheur, C., Andrews, J., Nitto, E.D., Eds.; ACM: New York, NY, USA, 2010; pp. 43–52. [Google Scholar] [CrossRef]
Luo, Z.; Xu, C.; Zhao, P.; Sun, Q.; Geng, X.; Hu, W.; Tao, C.; Ma, J.; Lin, Q.; Jiang, D. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Du, X.; Liu, M.; Wang, K.; Wang, H.; Liu, J.; Chen, Y.; Feng, J.; Sha, C.; Peng, X.; Lou, Y. Evaluating Large Language Models in Class-Level Code Generation. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, 14–20 April 2024; ACM: New York, NY, USA, 2024; pp. 1–13. [Google Scholar] [CrossRef]
Li, J.; Li, G.; Zhao, Y.; Li, Y.; Liu, H.; Zhu, H.; Wang, L.; Liu, K.; Fang, Z.; Wang, L.; et al. DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3603–3614. [Google Scholar] [CrossRef]
Dong, Y.; Jiang, X.; Jin, Z.; Li, G. Self-collaboration Code Generation via ChatGPT. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–38. [Google Scholar] [CrossRef]
Khoury, R.; Avila, A.R.; Brunelle, J.; Camara, B.M. How Secure is Code Generated by ChatGPT? In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023. [Google Scholar]
GitHub. GitHub Copilot is Generally Available to All Developers. 2025. Available online: https://github.blog/news-insights/product-news/github-copilot-is-generally-available-to-all-developers/ (accessed on 9 October 2025).
Zhu, M.; Jain, A.; Suresh, K.; Ravindran, R.; Tipirneni, S.; Reddy, C.K. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence. arXiv 2022, arXiv:2206.08474. [Google Scholar]
Liu, C.; Wan, X. CodeQA: A question answering dataset for source code comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Hasan, M.; Muttaqueen, T.; Ishtiaq, A.A.; Mehrab, K.S.; Haque, M.; Anjum, M.; Hasan, T.; Ahmad, W.U.; Iqbal, A.; Shahriyar, R. CoDesc: A Large Code-Description Parallel Dataset. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Karanjai, R.; Xu, L.; Shi, W. SolMover: Smart Contract Code Translation Based on Concepts. In Proceedings of the 1st ACM International Conference on AI-Powered Software, Galinhas, Brazil, 15–16 July 2024; pp. 112–121. [Google Scholar] [CrossRef]
Dong, Y.; Ding, J.; Jiang, X.; Li, G.; Li, Z.; Jin, Z. CodeScore: Evaluating Code Generation by Learning Code Execution. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–22. [Google Scholar] [CrossRef]
Weisz, J.D.; Muller, M.; Houde, S.; Richards, J.; Ross, S.I.; Martinez, F.; Agarwal, M.; Talamadupula, K. Perfection Not Required? Human-AI Partnerships in Code Translation. In Proceedings of the 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, 13–17 April 2021; pp. 402–412. [Google Scholar] [CrossRef]
Justin, D. Weisz and Michael Muller and Steven I. Ross and Fernando Martinez and Stephanie Houde and Mayank Agarwal and Kartik Talamadupula and John T. Richards. Better Together? An Evaluation of AI-Supported Code Translation. In Proceedings of the 27th International Conference on Intelligent User Interfaces, Helsinki, Finland, 22–25 March 2022; pp. 369–391. [Google Scholar] [CrossRef]
Liu, J.; Zhu, Y.; Xiao, K.; FU, Q.; Han, X.; Wei, Y.; Ye, D. RLTF: Reinforcement Learning from Unit Test Feedback. Trans. Mach. Learn. Res. 2023, accepted. Available online: https://openreview.net/forum?id=hjYmsV6nXZ (accessed on 9 October 2025).

Figure 1. Advancements of neural networks.

Figure 2. Paper selection process.

Figure 3. A yearly highlight summary of publications on neural methods for code considered in this survey.

Figure 4. Keyword co-occurrence network of reviewed publications.

Figure 5. Types of source code classification.

Figure 6. Utilization stages of neural methods in programming and SE tasks.

Table 1. Summary of previous survey works’ methodologies, strengths, weaknesses, and robustness.

Survey Paper	Methodology	Advantages	Disadvantages	Robustness
Allamanis et al. (2018) [16]	Narrative survey classifying ML models for ‘ Big Code’ and software naturalness using linguistic analogies.	Offers a clear taxonomy of ML models; shows foundational framing about conventional and idiomatic naturalness of code.	Lacks standardized benchmarks; some models (e.g., Transformers) were emerging and not fully covered.	Foundational and widely cited but predates large-scale pre-trained models.
Xiaomeng et al. (2018) [23]	Narrative review on ML in secure code review and vulnerability detection.	Good historical context; contrasts DL with earlier feature-based methods.	Outdated; lacks attention to recent graph neural networks (GNNs)/Transformer models.	Limited coverage and moderately outdated, but contextually informative.
Akimova et al. (2021) [24]	Review on DL-based defect prediction papers, categorized models, datasets, metrics, and investigated trends.	Comprehensive coverage of DL techniques, including data-labeling and generalization issues.	Narrow focus on certain PLs, limited Transformer models coverage.	Moderately useful for defect prediction, limited for closely related SE tasks.
Yang et al. (2022) [14]	Systematic literature review on DL-in-SE papers; organized by DL model architectures, tasks, datasets, and metrics.	Broad task coverage; excellent model categorization; metrics and dataset taxonomy included.	Reproducibility issues; underexplored cross-project transfer learning; lacks deep industrial insights, no coverage of LLMs.	Comprehensive and structured, though no coverage of the emergence of neural-based LLMs.
Zhang et al. (2022) [12]	Structured survey of automatic source code summarization, classifies the task into modeling, generation, and evaluation phases.	Balanced treatment of classic information retrieval and DL methods, covers datasets, metrics, and emerging models.	Limited to code-comment pairs discussion, no critical analysis on the Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and other weak metrics issues.	Good for code summarization, well-scoped and methodical, though limited cross-language coverage.
Samoaa et al. (2022) [17]	Systematic mapping study on code representations across SE tasks.	Maps token, tree, graph modalities; clear representation analysis.	Lacks model-specific analysis; omits modern pre-trained models.	Moderately strong on representation types, less on effectiveness.
Xu and Zhu (2022) [51]	Review of pretrained language models in code intelligence.	Covers pre-training and fine-tuning objectives, pre-trained language models’ architectures, and preprocessing, and code structure extraction pipelines in pre-trained language models.	Minimal coverage of non-Transformer and structure-based code intelligence models.	Strong overview of pre-trained language models and Transformer-based models, weaker on fine-grained discussion on programming tasks.
Le et al. (2022) [28]	Investigated data-driven vulnerability assessment by examining ML, DL and NLP techniques.	Practical view on the Common Vulnerability Scoring Systems’ alignment with reviewed themes and real-world focus.	Lacks depth insight on vulnerability-mitigation related challenges, e.g., patch modeling.	Comprehensive taxonomy of data-driven vulnerability in five themes.
Al-Hossami and Shaikh (2022) [27]	Broad review of AI in source code and conversational systems.	Unique angle on dialogue-based AI; educational application emphasis.	Limited depth on models and benchmarks; no technical coverage related to AI for programming.	Moderately exploratory and broad coverage of AI applications in education, but less technical with respect to PLs.
Amalfitano et al. (2023) [18]	Tertiary study synthesizing secondary surveys on AI for software testing.	Provides a holistic, high-level view of AI contributions to software testing; maps AI subdomains to testing tasks.	Lacks depth on datasets, models; omits emerging techniques and several aspects are not yet widely reviewed.	Averagely comprehensive in coverage, but breadth over depth reduces specificity.
Fontes and Gay (2023) [15]	Systematic mapping of studies applying ML to automated test generation; categorized by ML type, task, and metrics.	Captures detailed discussion on test and oracle generation; explores diverse ML methods.	Limited discussion on evaluation quality; limited coverage on neural networks, and no-idea about LLMs; replication gaps.	Strong in breadth but limited by inconsistent evaluation standards.
Xie et al. (2023) [25]	Review of DL-based code search systems with focus on encoding query and source code, and measuring their embedding similarity.	Highlights the concepts of deep code search, benchmarking, model limitations, and practical deployment issues.	Retrieval-centric; lacks broader coverage on SE use cases that align with programmer’s workflow focus.	Robust in evaluating code search methods with coverage of popular datasets and benchmarks. But no clue about LLMs.
Zhang et al. (2023) [22]	Review of automatic program repair (APR) structured by pipeline, dataset, and evaluation.	Clear taxonomy of APR approaches, focus on real bug benchmarks.	The review is Java centric, and the evaluation heavily depends on test-suite plausibility.	Highly detailed and accurate for APR pipelines.
Zakeri-Nasrabadi et al. (2023) [26]	Systematic literature review on code clone studies with method and dataset taxonomy.	Broad code similarity and clone detection taxonomy; tools/data gaps clearly identified.	Limited coverage of pre-trained language models and no clue about LLMs.	Excellent for classical clone detection methods.
Uddin et al. (2025) [13]	Analysis of DL for vulnerability detection; introduces a life-cycle from data construction to model deployment.	Life-cycle modeling gives structured perspective; highlights real-world gaps and practical considerations.	Narrower scope, and the analysis is biased toward C/C++ datasets.	Robust within the domain tailored for vulnerability detection, but it is less generalizable.

Table 2. High-level taxonomy of previous survey papers in the field.

Area of Focus	Examples
Task-specific surveys	Code Summarization [12], Comment Generation [53], Code Generation [54], Code Translation [55], Code Clone Detection [56], Code Vulnerability Detection [13], Program Repair [22], etc.
Surveys focused on subsets of neural methods	Review solely on either DL [14,57], LLMs [58], or Code LLMs [47]
Surveys focused on the phases of the Software Development Life Cycle, with limited emphasis on the coding phase.	Software Development Life Cycle phases [19,20] such as Requirement Engineering [52], System Design, and Project Management.
Surveys focused on traditional AI/ML methods with minimal focus on neural methods	Surveys with vast focus on traditional AI/ML methods [18,23] such as Random Forest, Decision Trees, XGBoost, Naive Bayes, Logistic Regression, and SVM.

Table 3. Distribution of collected papers by source.

S.No	Publication Source	Number of Publications
1	ACL, NeurIPS, ICLR	62
2	ACM	53
3	IEEE	49
4	IEEE/ACM (Jointly)	35
5	ArXiV	32
6	Elsevier, MDPI, Springer	29
7	Others	15
	Total	275

Note. IEEE/ACM (Jointly) in the above table indicates those papers published in the conferences or journals that are co-sponsored or jointly organized by both IEEE and ACM.

Table 4. Comparative perspective on neural models for natural and programming languages.

Aspect	NL Models (e.g., BERT, T5)	PL Models (e.g., CodeBERT and CodeT5)
Representative tasks	Question answering, sentiment analysis, abstractive summarization, cross-lingual transfer [65,70,80].	Code search, code summarization, defect prediction, code translation, code completion [8,69,71,75].
Training data and structural bias	Large-scale corpora such as C4 and multilingual web crawls; sequence-focused tokenization with limited structural priors [70,72,80].	Curated repositories (CodeSearchNet, CodeXGLUE) with AST/DFG-aware tokenization and structure-enhanced encoders [69,71,75].
Common metrics	Accuracy, F1, BLEU, ROUGE, cross-lingual transfer scores [65,70,80].	CodeBLEU, BLEU, Mean Reciprocal Rank (MRR), Precision@k, and exact match for code translation [8,71,75].
Empirical comparisons	Fine-tuned BERT variants serve as baselines for syntax-agnostic NLP tasks; performance drops observed on code understanding benchmarks.	CodeBERT improves over BERT on code search; CodeT5 outperforms T5 on summarization and generation tasks tailored to code [71,75].
Observed limitations	Difficulty modeling rigid syntax and long-range dependencies in source code; limited grounding in program semantics.	Sensitive to language-specific idioms and repository noise; struggles with cross-language generalization and semantic correctness without execution feedback [69,78].

Table 6. Comparison of major neural method families applied to programming and SE tasks.

Group Name	Advantages	Disadvantages	Common Application Areas
Sequence-based (RNN, LSTM, GRU)	Captures token-level dependencies; effective for small datasets; easy to fine-tune.	Limited context retention; sequential processing slows inference.	Comment generation, code completion, bug-fixing.
Transformer-based (CodeBERT, CodeT5, CodeT5+, PLBART, GPT-family)	Captures long-range context; supports parallel training; high generalization across PLs.	High computational cost; prone to hallucination; sensitive to noisy data.	Code translation, generation, summarization, synthesis.
Graph-based (GGNN, RGCN, AST-GNN)	Encodes structural and semantic information explicitly; interpretable representations.	Task-specific tuning required; limited scalability on large codebases.	Code analysis, clone detection, code search and retrieval.

Table 7. An overview of datasets curated for programming and software engineering tasks, utilized by neural models in code intelligence research.

Dataset (Year)	Task(s)	Trained/Evaluated Models	Metrics
Defects4J (2014)	Bug detection and repair (Java)	GenProg, Nopol, jKali, ASTOR, RepairAgent [212], ThinkRepair [213]	Repair success rate, Pass@k
BigCloneBench (2014)	Code clone detection	GNN, Code2Vec, CodeBERT, GraphCodeBERT	Precision, Recall, F1
Django (2015)	NL-to-code generation, pseudo-code generation	BART, CodeT5, kNN-TRANX	BLEU, CodeBLEU, EM
QuixBugs (2017)	Program repair (buggy algorithms)	GenProg, PolyCoder, Codex	Ranking test-passing patches, BLEU, Pass@k
PCSD (2017)	Code summarization, Comment Generation	Seq2Seq, CodeBERT, DECOM	BLEU, ROUGE, METEOR
CoNaLa (2018)	Code generation	CodeLLaMA, CodeGen, CodeGen2, CodeT5	BLEU, CodeBELEU, EM
Concode (2018)	Code generation	Encoder-decoder, Bi-LSTM,	BLEU, EM
CodeSearchNet (2019)	Code search and retrieval	Bag-of-words, CNN, RNN, CodeBERT, GraphCodeBERT	MRR, BLEU, ROUGE, Normalized Discounted Cumulative Gain at rank k (NDCG@k)
Devign (2019)	Vulnerability detection (C/C++)	GGNN, CNN, LSTM	Accuracy, F1
JuICe (2019)	Contextual code generation, API call	Transformer, LSTM	EM, BLEU, Accuracy, Precision, Recall
Bears (2019)	APR and bug fixing	MUFIN, etc.	Test suite correctness
TransCoder (2020/2022/2023)	Code translation (Java, C++, Python, Go, Rust)	TransCoder, TransCoder-ST, TransCoder-IR, rule-based baselines, GPT-3.5, LLaMA, CodeGen	BLEU, EM, Computational Accuracy (CA), Pass Rate
CodeXGLUE (2021)	Defect detection, code translation, clone detection, and more.	CodeBERT, GraphCodeBERT, CodeGPT, Encoder–Decoder	BLEU, ROUGE, Accuracy, F1, MRR
MBPP (2021)	Python code generation	GPT-3, WizardCoder, etc.	Pass@k, BLEU
HumanEval (2021)	Python program synthesis	Codex, GPT-3, GPT-Neo, etc.	Pass@k
APPS (2021)	Code generation	GPT-3, GPT-Neo, GPT-J, Codex	Pass@k
CodeNet (2021)	Code classification, code similarity, code translation	Multilayer Perceptron, Graph Convolutional Networks (GCNs), GNN, PLBART	Accuracy, BLEU, Runtime metrics
XLCoST (2022)	Code translation, code summarization, code search, and more.	CodeBERT, PLBART, CodeT5, RoBERTa, and more.	BLEU, CodeBLEU, MRR
CodeContests (2022)	Code generation	AlphaCode, GPT-4o mini, Qwen2.5-Coder-7B-Instruct, and more.	Test-case success rate, Pass@k
SecurityEval (2022)	Vulnerability detection (Python)	GPT-4o, InCoder, GitHub Copilot, AutoSafeCoder	Vulnerability occurrence rate, Pass@k
DiverseVul (2023)	Vulnerability detection (C/C++)	CodeBERT, CodeGPT, PolyCoder, CodeT5, NatGen, and more.	Accuracy, Precision, Recall, F1, False Positive Rate
CodeTransOcean (2023)	Multilingual code translation	GPT-4, CodeT5+, Claude-3.5	EM, BLEU, CodeBLEU, Debugging Success Rate (DSR@k), Pass@k
DS-1000 (2023)	Data science code generation	CodeGemma, DeepSeek-Coder, CodeLLaMA, Incoder, WizardCoder, and more.	Pass@k, Test Cases + Surface-Form Constraints
ClassEval (2023)	Class-level code generation	GPT-4, GPT-3.5, SantaCoder, WizardCoder, Intstruct-CodeGen, CodeGeeX, and more.	Pass@k
MCoNaLa (2023)	Multi-NL code generation and summarization on Python	gpt-3.5-turbo, mT5, CodeT5, mBART, PLBART	EM, BLEU, CodeBLEU
PrimeVul (2024)	Vulnerability detection (C/C++)	GPT-3.5, GPT-4, CodeT5, UniXCoder, StarCoder2, and more.	F1, Precision, Recall
EvoCodeBench (2024)	Code generation (Python)	GPT-4, GPT-3.5, StarCoder2, CodeLLaMA-7B, DeepSeekCoder-7B, StarCoder2-7B, and more.	Pass@k, Recall@k, Debugging Success Improvement (DSI)
Statement-level Code Summ. (2024)	Statement-level summarization	GPT-4, GPT-3.5, CodeLLaMA, StarChat	BLEU, SBCS (SentenceBERT + Cosine Similarity), SBED (SentenceBERT + Euclidean Distance), Human Eval
Python ClassGen (2025)	Class-level Python code generation	GPT-4	ROUGE-L, BLEU, Tree Similarity of Edit Distance (TSED)
DEFECTS4J-TRANS [Transformed Defects4J] (2025) [214]	APR	Magicoder-S-DS, WaveCoder-Ultra, CodeQwen1.5, DeepSeek-Coder-Instruct, and more.	plausibility and correctness of patches (plausible patch, correct patch)

Table 8. Neural methods vs. their counterparts (rule-based and statistical methods).

Method	Strengths	Limitations
Rule-based Methods	Effective for creating precise rules for corpora with complex code patterns.	Hard to maintain and scale. Requires extensive manual effort.
Statistical Methods	Offer more diverse algorithmic implementations than rule-based methods.	Human intervention is needed for better accuracy. SMT methods do not learn from previous experiences.
Neural Methods	Learn patterns from past data, delivering fast and accurate results.	Performance drops with limited training data and requires significant computational resources.

Note: Eighteen studies were examined for this comparison.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Maru, G.G.; Lee, S.; Ji, S.; Ko, S.-K.; Im, H. Neural Methods for Programming: A Comprehensive Survey and Future Directions. Appl. Sci. 2025, 15, 12150. https://doi.org/10.3390/app152212150

AMA Style

Maru GG, Lee S, Ji S, Ko S-K, Im H. Neural Methods for Programming: A Comprehensive Survey and Future Directions. Applied Sciences. 2025; 15(22):12150. https://doi.org/10.3390/app152212150

Chicago/Turabian Style

Maru, Gebremedhin Gebreslassie, Sanghwa Lee, Suhwan Ji, Sang-Ki Ko, and Hyeonseung Im. 2025. "Neural Methods for Programming: A Comprehensive Survey and Future Directions" Applied Sciences 15, no. 22: 12150. https://doi.org/10.3390/app152212150

APA Style

Maru, G. G., Lee, S., Ji, S., Ko, S.-K., & Im, H. (2025). Neural Methods for Programming: A Comprehensive Survey and Future Directions. Applied Sciences, 15(22), 12150. https://doi.org/10.3390/app152212150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neural Methods for Programming: A Comprehensive Survey and Future Directions

Abstract

1. Introduction

2. Related Works

3. Scientific Paper Selection Process

4. Neural Methods for Natural Languages vs. Programming Languages

5. Neural Methods for Programming and Code-Centric SE Tasks

5.1. Source Code Translation

5.2. Code Generation

5.3. Comment Generation

5.4. Decompilation

5.5. Code Search and Retrieval

5.6. Code Completion

5.7. Automatic Code Edit

5.8. Code Summarization

5.9. Code Change Detection

5.10. Code Similarity Detection

5.11. Program Synthesis

5.12. Code Modeling and Representation

5.13. Code Classification

5.14. Code Vulnerability Detection

5.15. Code Analysis

5.16. Code Authorship Identification

5.17. Program Repair and Bug Fix

5.18. Code Clone Detection

6. Data, Benchmarks, and Evaluation Metrics

6.1. Datasets and Benchmarks

6.2. Evaluation Metrics

6.2.1. Text-Match and N-Gram Overlap-Based Metrics

6.2.2. Embedding-Based Similarity and Learned Metrics

6.2.3. Classification Metrics

6.2.4. Security-Specific and Threat-Coverage Metrics

6.2.5. Execution-Based and Functional Correctness Metrics

7. Neural Methods vs. Traditional Approaches in Programming Tasks

7.1. Rule-Based Approach

7.2. Statistical Approach

7.3. The Novelty of Neural Methods

8. Discussion and Future Work

8.1. Answers to the Research Questions

8.1.1. RQ1: How Do Neural Approaches Compare to Rule-Based and Statistical Methods in the Context of Programming Tasks?

8.1.2. RQ2: What Is the Current Landscape of Datasets and Benchmarks for Neural Methods in Programming Tasks, and What Are the Critical Gaps?

8.1.3. RQ3: Which Evaluation Metrics Best Capture Model Performance on Code, Both Syntactically and Functionally, and Where Do Standard NLP Metrics Fall Short?

8.1.4. RQ4: What Roles Do LLMs (e.g., GPT-4, LLaMA, and Claude) Play in Programming Tasks?

8.1.5. RQ5: What Are the Main Bottlenecks in Scaling and Deploying Neural-Based Programming Solutions to Real-World Codebases, and How Can They Be Addressed?

8.1.6. RQ6: How Have Neural Methods for Programming Evolved, and Which Model- and System-Level Advances Have Driven This Progression?

8.2. Future Work and Open Issues

8.2.1. Evaluation Metric Issues

Code Semantics and Runtime-Behavior-Aware Metrics

Multi-Objective Scoring

Interactive and Explainable Evaluation

8.2.2. Data Pipelines and Code Representation

Unified Dataset Schema and Code Representation

Reproducible, End-to-End Pipelines

Expanding Language and Domain Coverage

8.2.3. Hybrid Neural Methods with Symbolic Architectures

Neuro-Symbolic Integration for Semantic Guarantees

Symbolic Augmentation for Certain Tasks

8.2.4. Resource Efficiency and Deployment Constraints

Model Compression and Distillation

8.2.5. Task-Specific Directions

Code Translation and Decompilation

Large-Scale Generation, Synthesis, and Automated Program Repair

Code Search, Retrieval, and Clone Detection

Comment Generation, Code Summarization, and Authorship Attribution

Code Completion and Privacy

8.3. Trends, Risks, and the Road Ahead

9. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI