Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support

Banitaan, Shadi; Daoud, Mohammad; Alquran, Hebah; Akour, Mohammad

doi:10.3390/info17010073

Open AccessSystematic Review

Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support

¹

Department of Computer Science and Engineering, American University of Sharjah, Sharjah 26666, United Arab Emirates

²

Department of Information Technology, Yarmouk University, Irbid 21110, Jordan

³

Computer Engineering Department, Al Yamamah University, Riyadh 11512, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 73; https://doi.org/10.3390/info17010073

Submission received: 5 December 2025 / Revised: 29 December 2025 / Accepted: 4 January 2026 / Published: 12 January 2026

(This article belongs to the Special Issue Surveys in Information Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

Foundation models are increasingly influencing software engineering research and practice, yet their adoption across the software development life cycle remains uneven and insufficiently characterized. This paper presents a systematic review of 224 recent studies investigating the application of foundation models to software engineering tasks. We introduce a two-dimensional taxonomy that systematically links software engineering life cycle phases with the foundation model capabilities employed, offering a unified view of current research practices. Our analysis reveals that existing work is heavily concentrated on implementation and testing activities, while earlier phases such as requirements engineering and architectural design, and process-oriented tasks, receive comparatively limited attention. Focusing on testing and quality assurance, we synthesize evidence across eight task categories, highlighting both demonstrated benefits and recurring challenges. This review is limited to peer-reviewed studies published between 2023 and 2025 and does not introduce new empirical models, focusing instead on synthesizing existing evidence. Overall, this review clarifies the current landscape of foundation model usage in software engineering and outlines actionable directions for future research and responsible adoption.

Keywords:

large language models (LLMs); foundation models; software engineering; software testing; evaluation benchmarks; reproducibility

1. Introduction

Foundation models, large pretrained models trained on vast and diverse datasets, are transforming software engineering (SE) tasks [1,2,3]. Examples such as Codex [4], GPT-4, StarCoder, and CodeT5+ show versatile capabilities across the software development lifecycle (SDLC) [5,6,7,8,9,10,11], including code generation, summarization, translation, and debugging. Their flexibility across programming languages and domains has led to fast adoption in tools such as GitHub Copilot and other AI-powered assistants, transforming how developers work and making software development more efficient. Yet challenges persist: models can produce plausible but incorrect (“hallucinated”) code [6], and licensing for outputs and training data remains unresolved [8,12]. Beyond post-hoc testing, white-/open-box correctness assessors predict code quality directly from LLM internal states, enabling early filtering of faulty generations [13]. Recent qualitative evidence also maps the socio-technical opportunities and barriers of generative AI (GenAI) adoption in software engineering teams, including concerns around trust, code ownership, and workflow integration, and provides practical guidance for responsible adoption in real development settings [14]. Complementing these observations, a recent empirical study compares ChatGPT-3.5 and LLaMA-2 on Stack Overflow questions and also documents a sustained post-ChatGPT decline in posting, answering, and commenting activity—initially uneven across domains but becoming broadly consistent within a year [15]. For test-driven development (TDD), LLMs can translate method descriptions into executable tests: A recent study reported that a fine-tuned GPT-3.5 model with optimized prompting achieved 78.5% syntactic correctness, 67.09% requirement alignment, 61.7% coverage, and 18.9% mutation score, substantially outperforming baselines [16]. Here, syntactic correctness measures whether the generated code compiles without errors, requirement alignment assesses semantic consistency with the reference specification, coverage denotes the proportion of relevant requirements or test cases exercised by the generated code, and mutation score quantifies how many injected faults are detected by the generated tests, serving as an indicator of test adequacy. Ablation studies showed that fine-tuning led to a 223% relative improvement in syntactic correctness (+223%) and prompt design contributed a further 124% gain (+124%), confirming that both factors are critical [17]. Complementary public-sentiment evidence from an analysis of about 90k Twitter/X posts on code generation tools highlights productivity gains alongside concerns over IP, transparency, and bias, emphasizing trust and accountability as preconditions for adoption [18].

The impact of foundation models is visible across all SE phases [19,20,21,22]. In requirements engineering, LLMs have been used not only for extraction and formalization, but also for generating and validating Software Requirements Specifications (SRSs), demonstrating quality comparable to entry-level engineers and assisting in end-to-end requirement-to-code pipelines [21,23,24,25,26,27,28,29,30,31]. During design, prompt-based techniques help developers make clearer and more consistent architectural choices supported by reasoning [3,32,33,34]. In the implementation stage, research shows that these models can generate and improve code. Models such as GPT-4 produce syntactically correct refactorings, though their success in preserving meaning differs between programming languages [35,36,37,38,39]. Variational Prefix Tuning (VPT) combines a conditional VAE with prefix-tuning to produce diverse yet accurate code summaries, improving standard summarization metrics over strong baselines [40]. In maintenance, foundation models have been applied to bug detection and automated program repair, with performance influenced by prompt design, defect type, and codebase context [36,41,42,43,44,45,46,47,48,49,50,51,52]. They are also used to classify technical debt and its specific types from issue trackers across diverse projects [53]. In software testing, large-scale evaluations reveal that LLM-generated test cases can achieve competitive coverage, but often require post-generation filtering or human refinement to ensure correctness and completeness [54,55,56,57,58,59,60,61,62]. Human-in-the-loop evaluations of LLM-based code translation (SQL dialects → PySpark) report high precision in most scenarios and sub-2-min responses, underscoring the role of prompt design and iterative refinement [63]. Despite these advances, persistent concerns remain regarding correctness, explainability, reproducibility, and alignment with project-specific or domain-specific constraints.

Prior surveys have examined the role of large language models (LLMs) and foundation models in software engineering [19,20,21,64,65,66,67]. Among these, Hou et al. [19] reviewed 395 studies published between 2017 and 2024, analyzing model architectures, dataset sources, preprocessing steps, optimization strategies, and evaluation methods. Their taxonomy mapped these models to six software engineering activities: requirements, design, development, quality assurance, maintenance, and management. However, the framework remains one-dimensional, organizing studies only by development phase or model type, without linking model capabilities to certain lifecycle stages or explaining how these capabilities carry over between them.

Wang et al. [20] focused specifically on software testing, reviewing 102 studies related to test generation, program repair, and quality assurance. While their review offers a detailed view of this phase, it does not link testing practices to other stages of the software lifecycle or to the broader capabilities of foundation models. Our study builds on and extends these efforts by proposing a two-dimensional taxonomy that connects software engineering lifecycle phases with foundation model capabilities. This framework supports reasoning in both directions, showing how certain capabilities influence multiple phases and how each phase affects their performance. It moves beyond descriptive mapping to offer a structured approach that highlights unexplored connections, encourages hypothesis-driven research, and provides a consistent analytic lens for future FM in SE studies.

This review aims to answer the following research questions (RQs):

RQ1: How are foundation models being used in the different phases of the software engineering process?
RQ2: What specific capabilities do these models provide, and how are these capabilities applied within various SE phases?
RQ3: What are the main strengths, limitations, and unresolved challenges when using foundation models for software testing?
RQ4: Where do current research and tools fall short, and what future directions could help advance foundation model–driven software engineering?

By addressing these research questions, this study offers a clear and structured view of how foundation models are being integrated across the software engineering (SE) lifecycle, along with a focused look at their use in software testing. Previous studies have mapped large language model (LLM) applications, which are a subset of foundation models, to SE activities, but our work goes further by introducing a comprehensive taxonomy that considers how foundation models can be applied throughout every phase of SE. This taxonomy provides a framework for organizing the literature, identifying patterns between model capabilities and lifecycle stages, and uncovering areas that remain underexplored. The in-depth analysis of software testing brings together current practices and tools, while also highlighting gaps and new opportunities often missed in broader reviews. The resulting insights guide both researchers and practitioners in making better use of foundation models for SE tasks, especially in testing, where quality assurance plays a central role. Below is a summary of the main contributions of this work.

A phase–capability taxonomy of foundation model use in software engineering. We introduce a two-dimensional taxonomy that connects where foundation models are applied across the software engineering life cycle (for example, requirements, design, implementation, testing, maintenance, and project management) with what capability they provide (such as code generation, summarization, or defect repair). Earlier surveys typically focused on only one of these aspects. Our taxonomy combines both dimensions, allowing us to identify well-studied areas and those that remain underexplored.
A systematic review of 224 recent studies with transparent selection and classification. We reviewed more than 500 papers and included 224 that demonstrate concrete uses of foundation models in software engineering. The review follows a reproducible protocol and maps each study to its corresponding life-cycle phase and model capability, offering a structured and up-to-date picture of current research activity.
A detailed evidence map of how foundation models support software testing and quality assurance. We organize and analyze prior work into eight testing and QA task families: unit test generation, oracle creation, fault localization, regression testing, UI testing, bug triage, vulnerability detection, and human-in-the-loop QA. For each task, we summarize typical workflows, empirical strengths, and known limitations such as non-determinism, oracle cost, or benchmark leakage.
A practitioner-oriented adoption agenda. Based on recurring strengths, limitations, and methodological patterns observed in the reviewed literature, we outline practical recommendations for integrating foundation models into software engineering workflows. These include retrieval-augmented prompting, execution- or verification-in-the-loop strategies, task-specific adaptation, and safeguards against data leakage and bias [68].

Together, these contributions provide both a structured view of how foundation models are used across the software engineering life cycle and practical insights to guide future research and industry adoption.

2. Novelty and Significance

This study goes beyond describing existing work. It contributes a new way to understand and evaluate how foundation models are being used in software engineering.

(1) Phase–Capability Taxonomy. Previous surveys often grouped studies by either the stage of the software life cycle or by the type of model capability. In this paper, we link both aspects to form a two-dimensional taxonomy that shows which capabilities are being applied at each stage. This approach reveals which areas, such as architecture design or project management, have received little attention and where research is already mature, such as code generation and unit test creation. To our knowledge, no prior work provides such a structured and reproducible evidence map.

(2) Task-Level Synthesis of Foundation Model Support for Testing and QA. Testing is usually treated as one broad category in prior work. We decompose it into eight specific testing and quality assurance tasks and summarize how foundation models contribute to each. This helps identify where they perform well, where they show promise but remain unstable, and where there is little evidence of benefit. This task-level synthesis transforms the general idea of “LLMs for testing” into a concrete and practical landscape.

(3) A Research and Adoption Agenda Grounded in Literature and Practice. This study combines insights from 224 publications to propose practical strategies for using foundation models in software engineering. The agenda emphasizes improving reliability and trust through retrieval-based methods, feedback from execution results, focused model adaptation, and evaluation that accounts for data leakage. These recommendations link research gaps with real engineering needs in a clear and actionable way.

This work offers an integrated view of how foundation models are currently being applied, where further research is most needed, and how professionals can adopt them responsibly. It serves not only as a review of existing studies but also as a roadmap that helps both researchers and practitioners understand and apply foundation models more effectively within software engineering.

The main contribution of this work is a structured method for classifying and analyzing how foundation models are used across the SE lifecycle. The method introduces a two-dimensional taxonomy that connects each phase of software engineering (such as requirements and design) with the specific capability of the foundation model applied in that phase (for example, code generation and summarization). This structure offers a unified perspective on how foundation models support various software tasks, addressing a gap found in earlier studies.

Motivational example: Consider the task of test generation. Some studies describe using language models to create unit tests, while others apply them to generate acceptance tests based on requirements. Without a clear taxonomy, these efforts seem disconnected, even though they rely on the same underlying capability, applied at different points in the development process. By placing both within the same capability category but in different phases (implementation versus quality assurance), the proposed taxonomy clarifies their relationship and reveals new possibilities, such as expanding test generation toward integration or security testing.

This example shows how the taxonomy not only organizes previous research but also helps identify gaps and guides more systematic progress in applying foundation models within software engineering.

The remainder of this paper is organized as follows. Section 3 reviews background concepts and related work, including prior surveys. Section 4 outlines the methodology for the literature review and data collection. Section 5 presents the proposed taxonomy. Section 6 delivers an in-depth analysis of the application of FMs in software testing. Section 8 outlines the challenges and future research directions, while Section 9 presents the conclusion of the paper.

3. Background and Related Work

3.1. Foundation Models in Software Engineering

FMs are large pretrained models adapted to a diverse array of downstream tasks [1]. In software engineering (SE), code-focused FMs (e.g., Codex/GPT, StarCoder, CodeT5+) support code generation, translation, summarization, repair, and test generation across multiple languages and ecosystems [6,8,59,69,70,71,72,73]. Beyond correctness-oriented generation, performance-aware approaches (e.g., E-code) combine pretrained models with an expert encoder group and efficiency-first selection to prioritize low-runtime code while maintaining quality [74]. Reported benefits include productivity gains and broader test/bug coverage, tempered by risks such as hallucinated code, sensitivity to prompts and seeds, and licensing/compliance concerns for both training data and generated outputs [6,8,12,75,76,77,78,79,80]. Recent empirical studies find that state-of-the-art AIGC (Artificial Intelligence Generated Content) detectors perform markedly worse on code than on natural language and only improve with domain-specific fine-tuning [81,82,83,84]. At the system level, serving choices significantly affect resource utilization: CUDA-backed configurations generally reduce energy and time relative to CPU, with TORCH + CUDA most efficient and ONNX/OV beneficial when constrained to CPU-only deployments [85,86,87].

Empirical studies show mixed but improving results in tasks like automated unit test generation, refactoring, and program repair [35,41,51,54,88,89,90,91,92,93,94,95]. Recent surveys synthesize these trends and emphasize the need for stronger evaluation practice and leakage-aware datasets [19,20,96,97]. Complementing detector-style analyses, transfer-learning for vulnerability prediction shows that using contextual word-level embeddings from Transformer models can match fine-tuning accuracy while cutting training/inference cost—outperforming sentence-level features [98]. Beyond architectures and datasets, operator-level advances (e.g., quantum-inspired high-order products that enable in-place fine-tuning with negligible overhead) may further improve downstream SE tasks by making task-specific adaptation more efficient [99,100].

3.2. Software Engineering Phases

To understand where foundation models (FMs) add value, we use a simple view of the software lifecycle. It includes the following phases: Requirements, Design/Architecture, Implementation/Coding, Quality Assurance, Maintenance/Evolution, Project/Process Management, and Other (work that spans several phases). Many past surveys focused only on one side—either the SE phase or the model’s ability. Here we connect both: where the model is used and what it does. This makes it easier to see which areas are active and which remain less explored [19,20].

Table 1 lists examples of how FMs are used in each phase of software engineering. This simple layout helps show patterns that were previously hidden in long lists of studies.

As Table 1 shows, most research still centers on coding, testing, and maintenance. Work in early phases like requirements and design, and in management areas, is growing but still limited. This shows that foundation models are mostly used for code-related tasks today, while planning and architectural work remain open areas for exploration.

3.3. Existing Surveys and Gaps

Several surveys have summarized the use of LLMs in SE [20,21,64,66,67,126,137,138,139,140,141,142,143,144], while complementary empirical studies analyze real-world challenges in LLM-based projects—for example, Cai et al. examined nearly 1000 GitHub issues across 15 open-source LLM projects to derive taxonomies of issues, causes, and solutions [145]. Hou et al. [19] reviewed 395 studies (2017–2024), classifying architectures (encoder-only, encoder–decoder, decoder-only), dataset sources and preprocessing, optimization and evaluation techniques, and mapping applications to six SE activities (requirements, design, development, quality assurance, maintenance, and management). While comprehensive, testing is largely subsumed under a broad “software quality assurance” umbrella; the review does not provide a capability–by–phase taxonomy nor a focused analysis of the testing phase.

In the context of software testing, Wang et al. [20] reviewed existing methods and tools. They emphasized structured test generation using task decomposition and templates. Their work highlights the value of keeping execution and verification in the loop: automatically compiling or running the generated code and tests, checking them with verifiers or static analyzers, and feeding the results back to the model to improve subsequent attempts. They also warn about data leakage and call for clear reporting of prompts, seeds, and model versions to ensure reproducibility. Complementary to leakage-aware evaluation, COCO targets robustness specifically by turning code features into added prompt constraints and checking for semantic consistency between original and concretized instructions, reporting markedly higher inconsistency detection than paraphrasing and translation-pivoting baselines and enabling robustness gains via fine-tuning [146].

Sasaki et al. [64] catalogued methodological support for prompting by reviewing prompt engineering patterns in SE. Their synthesis identifies reusable prompt structures and interaction patterns, but does not map these techniques to SE lifecycle phases or link them to capability-level outcomes. Complementing prompt-pattern catalogs, Ma et al. introduce Requirement-Oriented Prompt Engineering (ROPE), a requirement-centered training paradigm that significantly improves novice prompt quality and downstream outputs and shows that requirement quality strongly predicts LLM output quality [147].

Building on these surveys, we address four open gaps:

Phase–capability linkage. Prior work lacks a taxonomy that binds where in the lifecycle a contribution lands to what capability it exercises. We introduce such a two-dimensional taxonomy in Section 5 and use it to analyze 224 included studies.
Depth on testing. Existing reviews either aggregate testing under QA [19] or discuss techniques without a consolidated evidence map [20]. Section 6 organizes testing into eight task families, summarizes methods and datasets, and indicates where evidence is strong versus thin.
Leakage-aware, reproducible evaluation. Prior surveys highlight risks of data leakage and inconsistent reporting [19,20]. We foreground these issues in our analysis by synthesizing evidence on dataset comparability, prompt/seed sensitivity, and reporting practices, and by emphasizing the need for transparent reporting standards.
Actionable challenges → opportunities. Beyond listing limitations, we map cross-cutting challenges (variance, oracle cost, grounding, deployability, trust) to concrete opportunities (structure+retrieval, execution/verification in the loop, task-specific adaptation, leakage-aware evaluation, collaboration patterns) with exemplars (Section 8).

This positioning complements Hou et al. [19] by adding a capability–phase scaffold and extends Wang et al. [20] by providing a consolidated testing taxonomy and an evidence-oriented discussion tied to evaluation pitfalls and reporting practices highlighted in the literature.

4. Methodology

This systematic review examines peer-reviewed research published between 2023 and 2025 that investigates the use of foundation models in software engineering. Studies were collected from major digital libraries commonly used in software engineering research, including IEEE Xplore, the ACM Digital Library, ScienceDirect, and SpringerLink. The search targeted both journal articles and conference papers. An initial pool of records was identified using structured keyword queries related to foundation models and software engineering tasks. Studies were then refined through inclusion and exclusion criteria, screening titles and abstracts, removing duplicates, and performing full-text assessment following PRISMA guidelines, resulting in a final set of 224 studies.

This section explains the methodology used to answer the research questions introduced in Section 1. The approach includes three connected parts: (1) a systematic literature review (SLR) to identify and categorize how foundation models are used across the software engineering (SE) lifecycle, and (2) a review of tools and datasets that apply foundation models to different SE tasks.

The objective of the SLR was to identify, classify, and analyze studies that apply FMs across different SE phases. We followed PRISMA guidelines to ensure a transparent and reproducible process [148]. This systematic review was conducted in accordance with the PRISMA 2020 guidelines. The PRISMA checklist is provided in the Supplementary Materials, and the study selection process is documented using a PRISMA flow diagram (Figure 1).

The Search was conducted using IEEE Xplore, ACM Digital Library, SpringerLink, and ScienceDirect, covering publications from January 2023 to July 2025. Representative search strings combined FM-related terms with SE lifecycle terms, for example:

(“foundation model” OR “LLM” OR “large language model” OR “code generation” OR “generative AI”) AND (“software engineering” OR “software development” OR “requirements” OR “design” OR “implementation” OR “software testing” OR “maintenance”)

To ensure comprehensive retrieval, the search strings were refined iteratively through pilot searches and snowballing. Candidate keywords and task descriptors were extracted from an initial set of influential papers on foundation models in software engineering. This refinement process helped include terminology such as code synthesis, automated program repair, test generation, and requirements analysis, ensuring broader coverage of FM-related SE tasks while maintaining reproducibility of the search process.

To guarantee that the Boolean query captured the most relevant terminology, we performed an initial scoping and snowballing phase before finalizing the search string. Specifically, we reviewed a small set of early and frequently cited papers on LLM or FM applications in software engineering (e.g., code generation and bug detection) and examined both their references and citing papers to extract recurring task descriptors and model-related keywords. This iterative review surfaced key synonyms such as “large language model,” “foundation model,” “code synthesis,” “automated program repair,” and “test generation,” which were then consolidated into the final query. During formal database searches, we also applied light forward snowballing—checking the citation network of included studies—to capture newly published but conceptually related papers. This combined approach provided a good balance between breadth and detail while ensuring the results could be reproduced. We intentionally focused on studies published between 2023 and 2025 because foundation models, particularly LLMs, have only recently begun to be applied at scale in software engineering contexts. Earlier studies (pre-2023) largely explored traditional deep learning or smaller transformer models rather than the modern class of instruction-tuned foundation models that define the current generation of tools such as ChatGPT, Copilot, and Gemini. Restricting the review to this recent window allows a coherent comparison across works that share similar model capabilities, deployment practices, and evaluation settings. We acknowledge that this decision introduces a potential recency bias and may exclude some pioneering but pre-foundation work; however, those formative studies are referenced in our background section to provide historical context. Future extensions of this review could widen the time frame once the literature on foundation models stabilizes.

Inclusion and exclusion criteria. Inclusion:

The paper applies a foundation model/large-scale pretrained model (LLM/FM; encoder-only, encoder–decoder, or decoder-only) to a concrete SE task.
The contribution maps to at least one SE lifecycle phase in our taxonomy (requirements, design/architecture, implementation, testing/QA, maintenance/evolution, or project/quality management).
Reports sufficient methodological detail (task, data, procedure) and an empirical evaluation (e.g., on realistic benchmarks such as HumanEval, Defects4J, SF110, industrial logs, or comparable datasets) [149].

Exclusion:

Work that only improves or analyzes LLMs/FMs themselves (training tricks, NAS, alignment, ethics) without an SE task or artifact.
Out-of-scope domains with no SE activity (e.g., medical EEG/robot cognition, generic NLP classification, legal analysis without SE artifacts).
Position/vision papers without empirical validation; non-English publications; and duplicates.

Study selection and classification. The selection proceeded in three passes:

Title/abstract screening against the criteria above.
Full-text review for papers marked as potentially relevant.
Deduplication across venues and years.

We did not include arXiv in our search to avoid double counting preprints that later appear in peer-reviewed venues and to ensure that all included studies represent stable, citable, and peer-reviewed versions. Preprints also lack consistent versioning and metadata, which may compromise reproducibility of the review.

Two reviewers independently screened all titles/abstracts and full texts. Disagreements were resolved through discussion. Inter-rater reliability was assessed using Cohen’s kappa:

Title/abstract screening: $κ = 0.81$ (almost perfect agreement).
Full-text screening: $κ = 0.87$ (almost perfect agreement).

To illustrate the classification logic for borderline cases, consider studies that apply foundation models to generate architectural summaries or design-level abstractions from source code. Although such approaches operate on implementation artifacts, we classified them under the Design phase rather than Implementation, because their primary output supports architectural understanding and design reasoning rather than code construction or modification. In general, classification decisions were driven by the intended lifecycle role of the study’s main output, rather than by the input modality or the technical mechanism used by the model.

4.1. Quality Appraisal

We evaluated the methodological quality of the included studies using a lightweight checklist adapted from Kitchenham et al. [150]. The checklist assessed the following criteria: (1) clarity of the research objective, (2) adequacy of dataset description, (3) methodological transparency, (4) rigor of empirical evaluation, and (5) reproducibility of the study. Each study was rated as Yes, Partially, or No for each item. Quality assessment was not used as an exclusion criterion; rather, it informed the interpretation of the evidence and helped identify common reporting limitations in the reviewed literature.

The IEEE Xplore search returned 185 records. After title/abstract screening, 102 (55.1%) were retained for inclusion and 83 were excluded. The most frequent exclusion reasons were not an SE task (e.g., domain-specific NLP, robotics) and improves the LLM only (no SE artifact). One title-level duplication with ACM was identified and removed from the merged dataset.

Each included study is then assigned in a two-dimensional taxonomy: SE phases (rows) × FM capabilities (columns: code generation, summarization, translation, repair, test generation, bug/defect detection, requirements, architecture/design, other). For studies spanning multiple phases, we record a primary phase based on the dominant objective and tag secondary phases; when reporting counts, multi-phase studies contribute to all tagged cells, while narrative synthesis refers to the primary tag.

The ACM Digital Library search returned 112 records. After title and abstract screening, 97 papers (86.6%) met the inclusion criteria and 15 were excluded. The main exclusion reasons were the same as in the IEEE Xplore tranche: not an SE task or focused only on model improvement. No DOI-based duplicates were found between the two sources. However, during the merge process, two title-level duplicates were detected—one overlapping with the IEEE set and one internal to the ACM list—and both were removed. After deduplication, the combined IEEE and ACM dataset contained 197 unique papers. Each included study was classified using the same two-dimensional taxonomy of SE phases (rows) and FM capabilities (columns) to maintain consistency across tranches.

The ScienceDirect search, restricted to the Journal of Systems and Software and Information and Software Technology, returned 223 research articles. Title and abstract screening followed the same inclusion and exclusion criteria used for IEEE and ACM sources, focusing on studies using LLMs for software engineering tasks throughout the development lifecycle. This process retained 65 (29.1%) papers, with most exclusions due to not an SE task (e.g., domain-specific NLP in biomedicine or law) or LLM improvement only without direct SE application. Deduplication against the IEEE and ACM included sets yielded no overlaps, leaving all 65 papers as unique additions. Each included study was then classified using the same two-dimensional taxonomy (SE phases × FM capabilities) to ensure comparability across sources.

The SpringerLink search (filtered Research articles) returned 15 records. Applying the same title/abstract screening criteria used for IEEE, ACM, and ScienceDirect retained 14 (93.3%) and excluded 1 (6.7%)—primarily due to not an SE task/artifact. Deduplication against the existing corpus yielded no overlaps, so all 14 were unique additions. Each included study was classified using the same two-dimensional taxonomy (SE phases × FM capabilities). With SpringerLink incorporated, the working total number rises to 276 studies.

We chose IEEE Xplore, the ACM Digital Library, ScienceDirect, and SpringerLink because together they cover most peer-reviewed research in software engineering, providing broad and representative coverage. Table 2 summarizes the search results. The cumulative retained column reflects how many unique papers remain after merging and deduplication at each stage.

Following the title/abstract stage, we conducted full-text screening of the remaining 276 candidate studies to assess their eligibility against the inclusion and exclusion criteria (Section 4). Each paper was reviewed in detail. This process ensured that all included studies explicitly applied foundation models to concrete SE tasks and reported sufficient methodological detail to support analysis.

Of the 276 papers reviewed at the full-text level, 224 met all criteria and were retained for data extraction and classification. The reasons for excluding papers were primarily because they focused on model development without a SE task, lacked empirical evaluation, or addressed out-of-scope application domains. The final corpus for analysis, therefore, comprises 224 included studies, which form the evidence base for our taxonomy and subsequent analysis. As shown in Figure 1, our selection pipeline proceeds from identification to inclusion.

Following the PRISMA 2020 guidelines [148], we tracked each exclusion decision and its rationale using a shared screening spreadsheet. Figure 1 visualizes the four stages: identification, screening, eligibility, and inclusion. To increase transparency, Table 3 provides a breakdown of exclusion reasons at both title–abstract and full-text stages. Most removals during screening were due to studies that were not focused on software engineering (e.g., biomedical or legal NLP), while full-text exclusions were typically because of missing empirical evaluation or focusing on model training rather than SE tasks.

4.2. Representative Tools and Datasets

As the taxonomy and task-level analysis show, FMs are concentrated in code-adjacent testing activities. To better understand the ecosystem supporting these advances, we now survey representative FM-based tools and datasets that form the empirical backbone of current research.

Table 4 highlights widely studied assistants, including Codex [6], StarCoder [8], GPT-4 [7], CodeT5+ [10], RepairAgent [151], Code Llama [9], and AlphaCode [152], which collectively span phases from implementation to automated repair. Table 5 lists commonly used benchmarks, such as HumanEval [6], BigCodeBench [8], Defects4J [151], and TFix [10] for program repair. More recently, ConDefects [96] has been proposed to address leakage concerns, while MBPP [4], CodeXGLUE [153], and CodeContests [152] extend coverage to general programming, multi-task evaluation, and competitive programming. Together, these tools and datasets reinforce our taxonomy findings: empirical evaluations concentrate on implementation and code-adjacent testing tasks, with limited support for system-level or User Interface (UI) testing. This uneven coverage highlights where foundation model research remains concentrated—primarily in implementation and testing—and where earlier lifecycle phases such as requirements, design, and project management are still underexplored, indicating clear opportunities for future work.

A formal risk-of-bias assessment was not conducted because the included studies are heterogeneous software engineering empirical works for which standardized risk-of-bias tools are not applicable. In addition, no effect-size measures were defined because the review synthesizes heterogeneous qualitative and quantitative outcomes without producing pooled estimates. Moreover, no statistical heterogeneity analysis was performed because no quantitative meta-analysis was conducted. Sensitivity analyses were not applicable due to the absence of pooled statistical synthesis. Also, no statistical meta-analysis was conducted.

Taken together, the surveyed tools and datasets provide the infrastructure on which most evaluations are conducted, reinforcing our taxonomy findings (RQ2) and highlighting gaps (RQ4), particularly in system-level testing, UI acceptance, and human-in-the-loop contexts.

5. A Two-Dimensional Taxonomy of FM Use in SE

We categorize prior work along two orthogonal axes:

Axis 1—SE Phases.

We adopt seven coarse-grained phases that capture where a study primarily contributes: Requirements, Design/Architecture, Implementation/Coding, Testing/QA, Maintenance/Evolution, Project/Process Mgmt, and Other (studies spanning multiple phases or not mapping cleanly to a single phase).

Axis 2—FM Capabilities. We classify what the foundation model (FM) is used for: Code generation, Summarization, Translation, Repair, Bug/defect detection, Requirements, Architecture/design, and Other. We keep these capability labels aligned with our screening codebook so the taxonomy is reproducible.

In this review, we use the term capabilities to refer to what the foundation model actually does, such as generating code, summarizing artifacts, or detecting defects. This is different from phases, which describe where in the software engineering lifecycle the task takes place (for example, requirements, design, implementation, or testing). Some lifecycle terms are sometimes used in the literature as task labels, which may cause confusion. In our taxonomy, phases and capabilities are kept separate: phases indicate the stage of the lifecycle, and capabilities describe the model’s function.

Why two axes? Earlier classifications usually focused on only one dimension. Some were phase-centric, grouping studies by software engineering (SE) lifecycle stages such as testing or maintenance. Others were capability-centric, organizing research by what the model does, for instance, code generation or summarization. Our taxonomy combines both perspectives by linking where a model is applied (the SE phase) with what capability it demonstrates. This two-axis view highlights areas of high research activity and uncovers underexplored intersections. For example, there is strong attention to bug and defect detection within testing and QA, whereas few studies address summarization or translation within design and architecture. The approach also remains compatible with prior surveys while providing finer granularity and clearer comparisons across research domains [1,19,20].

5.1. Taxonomy Table and Summary

Table 6 reports counts of included studies by Primary SE Phase (rows) and FM Capability (columns). Counts are based on our final included set (

n = 224

). The table is auto-generated from our screening sheet to ensure traceability. As shown in Figure 2, the heatmap illustrates the distribution of included studies across SE phases and FM capabilities, with clear concentration in implementation and testing. Figure 2 and Table 6 quantitatively ground the proposed phase–capability taxonomy by showing the empirical distribution of included studies across lifecycle phases and FM capabilities. The concentration of studies in testing and implementation phases, contrasted with sparse coverage in design and architectural activities, highlights structural research imbalances rather than incidental gaps. These patterns motivate the analytical discussion of underexplored areas in the subsequent sections.

We hypothesize that the limited coverage of design and architectural activities stems from several structural factors rather than lack of relevance. These phases require long-range abstraction, cross-artifact reasoning, and stable representations of design intent, which remain challenging for current foundation models. In addition, the absence of standardized benchmarks and quantitative evaluation criteria at the design level constrains empirical study. These factors help explain the observed imbalance and point to concrete directions for future research.

High-level distribution. By phase, the largest clusters are Testing/QA (64 studies) and Implementation/Coding (58 studies), followed by Maintenance/Evolution (44 studies), Requirements (19 studies), Project/Process Mgmt (14 studies), Other (13 studies), and Design/Architecture (12 studies). The most common capabilities were Code generation (57 studies) and Bug/defect detection (39 studies) lead, followed by Other (43 studies), Summarization (20 studies), Test generation (19 studies), Repair (19 studies), Requirements (17 studies), Translation (6 studies), and Architecture/design (4 studies).

Prominent intersections. The most frequent phase × capability pairings are: (i) Implementation/Coding × Code generation (43), (ii) Testing/QA × Bug/defect detection (28), (iii) Testing/ QA × Test generation (19), (iv) Maintenance/Evolution × Repair (17), and (v) Requirements × Requirements (14). These confirm an emphasis on developer assistance and quality assurance tasks where FMs can produce, transform, or assess code and artifacts at scale.

These patterns align with the ecosystem snapshot in Section 4.2, where a small set of widely used tools and benchmarks (Table 4 and Table 5) concentrates evidence in implementation and code-adjacent testing tasks.

5.2. Insights and Gaps

Across the 224 reviewed studies, most research continues to focus on code-related phases such as Implementation and Testing/QA. This trend reflects that language models work best on tasks involving clear structure or explicit code, such as generation, debugging, and repair, where abundant data and evaluation tools are available. In contrast, earlier phases like Requirements and Design have received much less attention, partly because current model designs struggle with abstract reasoning, linking information across artifacts, and supporting creative design work. Overall, this imbalance shows both the relative maturity of automation in coding and testing and the ongoing opportunity to expand model use toward higher-level reasoning in software engineering.

RQ3—Strengths, limitations, and open challenges in testing.

With 64 testing papers, the strongest areas are Bug/defect detection (28) and Test generation (19), followed by smaller pockets in Code generation (6), Summarization (2), Repair (2), and Translation (1). Key strengths include rapid generation of test inputs and oracles, and leveraging FMs for defect localization or classification. Open challenges include: (1) Reliability & reproducibility (nondeterminism, prompt/seed sensitivity), (2) Coverage & adequacy (behavioral coverage beyond simple metrics), (3) Ground-truth & evaluation (scarcity of unbiased, realistic benchmarks), (4) Security & robustness (vulnerability injection/omission, adversarial prompts), and (5) Cost & governance (token, privacy, and IP concerns in CI/CD).

RQ4—Gaps and future directions.

We observe several underexplored intersections (cells with ≤1–2 papers) that are promising:

Design/Architecture × Summarization/Translation: using FMs to summarize design rationales, migrate architecture documentation, or translate between modeling notations; and to preserve consistency across heterogeneous models in low-code/model-driven settings [155], or translating between modeling notations and generating development artifacts like traces [155,156].
Requirements × Summarization/Translation: accelerating requirements triage (e.g., using LLMs to locate and classify requirements in agile backlog items [157]), de-duplication, and multilingual elicitation with traceability preservation [157,158].
Project and Process Management: applying foundation models to project planning, risk assessment, and incident analysis—going beyond the earlier ad hoc or miscellaneous “Other” applications noted in prior studies. The acronym FAIL stands for Failure Analysis via Intelligent Learning; it uses LLMS to automatically collect, cluster, and analyze software failure or incident reports. These analyses produce what are known as postmortem-style corpora—collections of structured summaries written after incidents occur—which can then be compared across organizations to identify recurring causes and improvement opportunities. Recent studies also use LLMs to analyze CI/CD pipelines, summarize process issues, and identify governance or planning problems [159,160].
Testing/QA × Summarization: assisting test oracle explanation and failure report condensation for developer handoff. For example, LLMPrior clusters and prioritizes crowdsourced textual test reports to reduce reviewer reading load and speed triage [161].
Maintenance/Evolution beyond Repair: structured refactoring [38,162,163], architecture conformance checking, and migration tasks that combine code and design knowledge; additionally, change summarization for evolution via commit-message generation [164,165].
Domain-specific expansion: Applying and tailoring FM techniques to specialized domains beyond general software, such as hardware design (e.g., VHDL [166], Verilog [167]), geospatial programming [168], and control code generation from images [169]. This expansion necessitates the creation of specialized benchmarks and prompts to handle domain-specific syntax, semantics, and constraints [170].

From a methodological perspective, there is also a clear need for: (1) stronger, task-focused benchmarks that reduce data leakage and better reflect real development settings; (2) longer-term and human-in-the-loop studies to evaluate developer outcomes such as quality, efficiency, and learning impact; (3) transparent reporting of experimental details, including prompts, random seeds, and model versions; and (4) studies that involve multiple types of software artifacts (for example, code, tests, requirements, and telemetry) to understand how foundation models reason across different sources of information.

The taxonomy shows that most research still focuses on code-centered tasks such as generation, testing, and repair, while early design and process-oriented support remain limited. We recommend developing new datasets and evaluation methods that connect these phases, while ensuring reproducibility and following solid research governance practices.

6. In-Depth Analysis: Foundation Models in Software Testing

This section provides an in-depth analysis of how foundation models are applied to software testing and quality assurance. It organizes prior work using a task-level taxonomy and synthesizes empirical findings across major testing activities. Methodological foundations and evaluation practices are discussed separately in Section 7, while cross-cutting challenges and opportunities in software testing are synthesized in Section 7.

As discussed in Section 5, Testing/QA is the most represented phase in our dataset. Accordingly, this section focuses on organizing FM-driven testing tasks (RQ1–RQ2) and synthesizing empirical patterns observed across the literature. Testing-specific challenges and opportunities are discussed only to the extent that they arise directly from these tasks, while broader lifecycle-wide issues are addressed later (RQ3–RQ4).

Importantly, we emphasize how testing outputs such as generated tests, localized faults, and vulnerability reports propagate to downstream phases including maintenance, security, and process decision-making.

To reduce fragmentation across the literature, we group existing work by the primary testing task being supported. This task-level organization helps clarify where foundation models are most effective, where limitations persist, and how different approaches relate to one another. We categorize the use of foundation models in software testing into eight main groups, ranging from code-focused tasks to system-level activities. (Translation tasks in testing include converting natural-language bug reports into executable tests, transferring tests across frameworks or programming languages, and migrating checkers or assertions between test harnesses; for example, report-to-test or framework-to-framework migration.)

To give readers a clear overview before discussing each group in detail, Table 7 summ- arizes the main testing tasks supported by foundation models, outlining their main functions, representative systems, strengths, and recurring challenges. The table offers a quick comparison across different areas of work, showing both the variety of testing goals and the recurring patterns in how these models are used. It highlights not only current shortcomings but also the tangible progress made within each group.

Several trends emerge from Table 7. Unit test generation and fault localization show strong results, with foundation models reaching higher coverage and accuracy than traditional methods. Property and oracle generation show how these models can extend human-written specifications, though they are still affected by verification delays and the cost of building oracles. Differential and UI or acceptance testing illustrate that using execution feedback and retrieval-based methods can make these models practically useful at the system level, even if latency and test instability remain concerns. Lastly, security-related and human-factor studies stress the importance of clarity and seamless workflow integration, showing that successful use depends as much on trust and usability as on technical precision.

Across these task categories, several recurring design principles emerge that explain performance differences across testing contexts. Code-centric tasks such as unit test generation and fault localization benefit most from structured prompting and fine-tuning, where local syntax and semantics dominate. In contrast, system- and UI-level testing require retrieval and execution feedback to ground model reasoning in dynamic state. Oracle-free tasks, such as differential testing, are more tolerant to generation noise but incur higher execution cost, while security testing favors hybrid architectures that constrain model outputs for precision. These patterns help explain why no single FM-based technique generalizes uniformly across all testing tasks.

6.1. Unit Test Generation

Foundation models generate compilable tests with nontrivial coverage, but outcomes hinge on prompting and model choice [54,88,89]. Adding structure—for example, method slicing through the HITS (Hierarchical Test Slicing) framework or structured seed cases using STRUT (Structured Unit Test Templates)—consistently improves executability and coverage [75,76]. Beyond method-level studies, AgoneTest automates end-to-end generation and assessment of class-level JUnit suites and reports ∼75% compilation and ∼34% passing rates on real Java projects [174]. Kang et al. introduce Libro, a report → test generator that produces bug-reproducing JUnit tests from natural-language bug reports, reproducing about one-third of Defects4J cases and ranking candidates to reduce developer inspection effort [200]. These generated tests directly support maintenance and evolution by enabling regression testing and validating subsequent code changes.

6.2. Property & Oracle Generation

Foundation models have been explored as a means of reducing the manual effort required to construct test oracles and formal properties, particularly in settings where explicit specifications are incomplete or unavailable. In this context, LLMs are primarily used to draft candidate properties, invariants, or metamorphic relations that can subsequently be refined through automated verification or human review. This reframes oracle construction as a semi-automated process, where model-generated artifacts serve as starting points rather than final ground truth.

LLMs draft formal properties and oracles that become actionable when paired with verifier- or tool-in-the-loop refinement, reducing but not eliminating syntax and semantic issues [176,177]. In safety-critical simulation settings, a human–AI hybrid metamorphic testing workflow uses GPT to instantiate simulator-specific metamorphic relations and an automated harness; on CARLA it exposed four previously unknown defects while directly addressing the oracle problem [61]. Despite these advances, oracle generation remains constrained by verification latency and domain specificity, reinforcing the need for hybrid pipelines.

6.3. Fault Localization (FL)

Fault localization is one of the most empirically mature applications of foundation models in software testing, largely because it can be framed as a code-centric reasoning task that does not depend on test execution. Recent approaches reformulate localization as a ranking or explanation problem over program statements, allowing LLMs to operate directly on source code even when test suites are incomplete or unavailable. Fault localization benefits from FM-based methods, either test-free adapters on top of code LLMs or fine-tuned LLMCs, both showing clear Top-k gains [178,179,201]. While these methods consistently outperform learning-based baselines, they also reveal limitations related to context length, code segmentation, and dataset leakage, motivating continued interest in hybrid designs that integrate structural program information. Localization results naturally inform the maintenance phase by guiding debugging, repair, and refactoring activities.

6.4. Differential/Regression Testing

Differential and regression testing aim to identify behavioral inconsistencies without relying on explicit correctness oracles, making them well suited for foundation model–guided exploration. In this setting, LLMs are used to steer test generation toward inputs that maximize divergence, often through iterative refinement based on runtime feedback rather than one-shot generation.

Execution-driven prompting with feedback loops makes differential testing practical; MoKav is a representative system [182]. Empirical studies show that execution-in-the-loop strategies outperform classical generators in exposing semantic differences, though their effectiveness depends on reliable execution harnesses and remains sensitive to nondeterminism and scalability constraints.

6.5. System/UI Acceptance Testing

System- and UI-level acceptance testing requires reasoning over dynamic interfaces, application state, and external services, making it substantially more complex than unit-level testing. Foundation models address this complexity by translating high-level task descriptions into structured interaction plans, often supported by retrieval mechanisms that ground model outputs in concrete UI artifacts.

Retrieval-augmented LLM planners automate UI and acceptance testing at scale, as in WeChat and LLM4Fin [183,184]. While these systems demonstrate practical viability, challenges related to latency, robustness under UI evolution, and execution cost persist, leading current approaches to favor hybrid architectures over fully autonomous testing pipelines.

6.6. Static Analysis Triage & Semantic Assistance

Static analysis triage focuses on interpreting, prioritizing, and contextualizing analysis results rather than merely detecting defects. In this setting, foundation models are used to bridge semantic gaps arising from incomplete program context, indirect calls, or informal specifications such as comments and documentation.

FMs complement static analysis by filling semantic gaps and assisting triage, e.g., for bug detection and inconsistency rectification [115,185,186]. Complementary to these, RustC4 combines LLM-based constraint extraction with AST checks to detect code–comment inconsistencies across 12 Rust projects (176 real cases; 23 confirmed fixes) [202]. Although promising, these approaches remain language- and task-specific, explaining their comparatively lighter empirical coverage.

6.7. Security Testing & Vulnerability Analysis

FMs aid vulnerability detection through compact learned detectors and LLM+analysis hybrids, including real-time micro-architectural attack detection from HPC signals [203]. For example, SecureFalcon and Hyperion combine model reasoning with program analysis for smart contracts [191,192]. At the same time, code-pretrained detectors such as VulCoBERT (CodeBERT + Bi-LSTM) remain competitive, reporting 66.21% accuracy on Devign and ranking third on the CodeXGLUE defect-detection leaderboard [204,205]. SVA-ICL advances vulnerability severity assessment via code–text in-context learning, outperforming prior SVA baselines on a 12k CVSS-v3 C/C++ corpus [206]. Beyond detection, LLM-assisted repair pipelines such as ContractTinker integrate chain-of-thought prompting with static analyses (dependency graphs, slicing) to synthesize and validate patches for real-world contract vulnerabilities [194]. At the change level, LLM4VFD links code-change intent with linked issues/pull requests (PRs) and historical fixes to detect vulnerability-fix commits, outperforming PLM baselines and releasing a post-2023 dataset to reduce leakage [207]. Emerging threat analyses reveal that “vulnerability propagation attacks”—which seed insecure patterns via few-shot prompts—can persist across generations and sessions, motivating session-aware evaluation and defenses [208]. Finally, LLM-assisted vulnerable-function identification from Common Vulnerabilities and Exposures (CVE) descriptions (e.g., VFFinder) improves reachability analysis and reduces Software Composition Analysis (SCA) false positives (Top-1 27.27%, MRR 0.39; Snyk/DC FPR ↓ to 4.6%/3.7% with Top-50 candidates) [209]. Detected vulnerabilities and severity assessments feed into security governance, patch prioritization, and compliance-related lifecycle decisions.

6.8. Human-in-the-Loop Testing Practice

Human-in-the-loop testing positions foundation models as assistive tools that generate candidate tests, explanations, or prioritizations for developer review. This paradigm shifts the focus from full automation to decision support, emphasizing usability, trust, and interaction quality alongside technical accuracy.

Human-in-the-loop studies show FM assistance improves coverage and defect discovery, but also raises false positives requiring triage [196,210]. Empirical evidence from Copilot further shows that later suggestions are not less likely to be correct, and reviewing multiple (≈4–5) improves the odds of selecting the right one [211]. These findings highlight the importance of workflow integration and trust calibration in FM-driven testing tools. In practice, humans assume distinct roles as reviewers of generated artifacts, validators of correctness, and decision-makers who select among alternatives, with foundation models supporting each role differently. Human feedback collected during testing influences downstream decision-making by shaping tool trust, workflow integration, and adoption strategies.

Overall, the literature shows that foundation models are most effective for code-centric testing tasks, particularly unit test generation and fault localization. Their impact diminishes as tasks require stronger semantic grounding, scalable execution, or high-confidence oracles. These observations motivate a closer examination of the methods and evaluation practices used to support FM-driven testing, which we address next.

Although reported metrics vary across studies, several quantitative patterns recur. Unit test generation studies commonly report compilation rates between 60–80% and passing rates around 30–50%, while fault localization consistently shows Top-k improvements over learning-based baselines. System- and UI-level testing emphasizes task completion rate and cost, often trading higher latency for improved coverage. These trends illustrate typical performance ranges and trade-offs rather than absolute benchmarks.

Across testing tasks, foundation models generate artifacts that propagate beyond the testing phase. Test suites, localized faults, vulnerability reports, and human feedback loops directly inform maintenance, security, and process-level decisions, reinforcing the role of testing as a central connector between development, evolution, and governance phases. This cross-phase flow aligns with the proposed phase–capability taxonomy, where testing acts as a key integration point for FM-driven capabilities.

7. Methods, Benchmarks, and Evidence

This section examines how FM-driven testing approaches are implemented and evaluated in practice. Rather than organizing by task, we group methods by recurring design patterns and evidence types, highlighting how grounding, feedback, hybrid analysis, and human involvement are used to improve reliability and deployability.

Retrieval-Augmented Generation (RAG).

Retrieval-augmented generation grounds model outputs in task-specific context. For example, WeChat leverages internal traces/locators for grounded UI planning [183]; LLM4Fin retrieves domain rules to compose executable acceptance scenarios [184]. Beyond UI/business domains, RAG has also grounded HDL debugging against design specifications, enabling iterative identification and correction of Verilog functional bugs [167]. An industrial experience report called RAGVA (Retrieval-Augmented Generation Virtual Assistant) describes the engineering of a RAG-based production assistant and outlines evaluation, testing, and Responsible-AI challenges [212].

Structured Prompting/Generation. Recent studies indicate that the effectiveness of foundation models in software testing is strongly influenced by the structure of the prompting strategy rather than relying on unconstrained code generation. Structured prompting approaches introduce intermediate representations, such as schema-driven or seed-based test descriptions, that guide the model toward generating valid, high-quality test artifacts. For example, the STRUT framework proposes generating structured test cases consisting of inputs, expected outputs, and stubbed behaviors before converting them into executable code using rule-based transformations [75]. By decoupling test intent from language-specific syntax, structured prompting mitigates common issues observed in direct LLM-based test generation, including low compilation success rates, inconsistent assertions, and inadequate coverage. Empirical results demonstrate that such approaches substantially improve execution pass rates and both line and branch coverage, particularly for languages with complex semantics. Overall, structured prompting serves as a critical methodological mechanism for improving the reliability, controllability, and practical applicability of foundation model–assisted test generation [76].

Execution Feedback Loops. Differential testing with execution-in-the-loop (MoKav) outperforms one-shot prompting and classical generators on difference-exposing tests [182]. Similarly, in vulnerability repair, VRpilot couples chain-of-thought reasoning with compiler/sanitizer/test feedback to iteratively refine patches and increase correctness [213]. Parallelization studies also highlight performance–correctness trade-offs: Yadav and Mondal benchmarked 23 pre-trained LLMs on PolyBench kernels, finding that while 26.7% of generated code variants achieved higher speedups than the Intel compiler, only 13.5% were functionally correct due to concurrency issues [214].

Hybrid Analytical Models. LLM + analysis increases practical precision: DApp inconsistency detection fuses LLM text understanding with data/flow analyses [192]; compact, efficient models are tuned for deployable vulnerability detection (SecureFalcon) [191]. Static planners over CFGs reduce semantic blind spots in incomplete code [185].

Human-in-the-Loop Evidence. Beyond fully automated techniques, empirical evidence increasingly supports human-in-the-loop paradigms in which foundation models act as interactive assistants during testing activities rather than autonomous generators. Controlled experiments comparing LLM-supported and manual unit testing show that testers using LLM assistance generate more test cases, achieve higher coverage, and detect a greater number of defects within fixed time constraints [196]. In these settings, LLMs are used interactively to suggest test scenarios, explore edge cases, and assist with assertion construction, while human testers retain responsibility for interpretation and validation. Although increased productivity may lead to a higher number of false positives, studies report no inherent degradation in test correctness attributable to LLM usage. Instead, results suggest a manageable trade-off between test quantity and precision that can be effectively addressed through human oversight. These findings provide strong empirical support for human-in-the-loop testing as a practical and effective mode of integrating foundation models into real-world software quality assurance workflows.

Benchmarks & Leakage Awareness. Defects4J/BugsInPy remain standard for test gen/FL; leakage concerns motivate time-sliced or complementary corpora (e.g., ConDefects) and careful reporting [19,20,96]. Common metrics: executability/compilability, coverage (line/branch), mutation score, distinct bugs found; for FL, Top-k and rank; for UI/security, cost/latency and precision/recall.

Cross-Cutting Challenges and Opportunities

The challenges discussed in this subsection arise specifically from FM-driven testing practices and their supporting methodologies. Rather than restating the full list, the seven cross-cutting challenges (C1–C7) and five corresponding opportunities (O1–O5) are summarized in Table 8, which consolidates them with key references and actionable insights. Briefly, the challenges include prompt and seed variability, costly oracle construction, dataset leakage and comparability issues, scalability for system and UI testing, semantic gaps in static analysis, deployment trade-offs for security tasks, and limited developer trust or workflow fit. The mapped opportunities highlight structured prompting and retrieval, execution-in-the-loop verification, task-specific adaptation, leakage-aware evaluation, and human–AI collaboration frameworks for trustworthy integration.

We now briefly synthesize the testing-focused findings with respect to the research questions introduced earlier. The following synthesis maps the testing-focused findings (RQ1–RQ2) to broader lifecycle implications and research directions (RQ3–RQ4).

RQ1: Where are foundation models most applied across software testing?

They are primarily used in bug and defect detection, unit test generation, and test maintenance/refactoring, reflecting the strong alignment between LLM capabilities and code-centric testing tasks. These areas benefit from models’ ability to synthesize and reason about code, but coverage of higher-level oracles and non-functional testing remains limited.

RQ2: What capabilities of foundation models are most leveraged in testing?

The most common uses involve code generation, summarization, and repair. These are applied in test creation, fault localization, and defect analysis, but issues with reliability and verification still need to be addressed.

RQ3: What are the key strengths and limitations of foundation models in software testing?

Key benefits include faster test creation, quicker debugging, and higher developer productivity. At the same time, issues such as limited reproducibility, uneven coverage, biased benchmarks, and the effort needed to integrate with CI/CD systems continue to pose difficulties.

RQ4: What research gaps and opportunities remain?

Future work should target underexplored intersections such as design/architecture summarization, cross-phase reasoning, and multi-modal test artifacts. Additional directions include enhancing evaluation rigor, promoting human–AI collaboration, and addressing trust and compliance challenges for real-world deployment.

8. Challenges and Future Directions

This section synthesizes cross-cutting challenges observed across phases (RQ3) and distills a research agenda (RQ4). The proposed opportunities arise directly from limitations repeatedly identified in the reviewed primary studies, translating empirical shortcomings into actionable research directions. We close with threats to validity for our review and for FM-in-testing studies.

As summarized in Table 8, we map the seven cross-cutting challenges (C1–C7) to actionable opportunities (O1–O5) with representative citations. Below, we elaborate on each opportunity and its intended impact.

8.1. Cross-Cutting Challenges

C1—Prompt/seed/model variance and harness sensitivity.

Results depend strongly on prompt design, random seeds, and evaluation harnesses; small changes can flip conclusions, especially for unit-test generation [54,88,89]. Reporting prompts, seeds, sampling policies, and timeouts are still inconsistent.

C2—Oracle construction and verification costs.

LLM-generated specifications/properties require syntax and semantic repair; verifier-/ tool-in-the-loop improves acceptance but adds latency and engineering overhead [176,177]. Differential testing reduces some oracle burden but introduces runtime infrastructure costs [182].

C3—Data leakage and dataset comparability.

Many standard benchmarks predate modern LLMs, risking overlap with pretraining corpora. Leakage obscures progress and hinders fair comparisons across FL/repair/test-gen [19,20,96], and calls into question the validity of traditional NLP-based evaluation metrics [216].

C4—Grounding and scalability for system/UI testing.

Large apps require grounding in screens/DOMs, business rules, and workflow graphs. Retrieval-augmented planning helps, but reliability, latency, and cost remain central concerns in production [183,184].

C5—Semantic gaps in static contexts.

Static analyses suffer from imprecision (e.g., indirect calls, partial snippets). Planner-style static reasoning and semantic summaries help but are not yet turnkey [185,222].

C6—Deployability in security pipelines.

Security settings demand high precision at low latency (often CPU-only). Compact task-adapted models and analysis hybrids are promising but need broader evaluation and lifecycle tooling, and hardware-side accelerators with managed caches offer additional latency relief [191,192,223].

C7—Sociotechnical integration and trust.

Human studies show productivity and coverage gains with LLM assistance, but also more false positives and triage load; teams need guidance on when to trust, edit, or discard AI outputs and how to integrate them into IDE/CI [196,224,225].

8.2. Future Research Opportunities

O1—Structure over raw prompting.

Systematically encode structure before generation (e.g., method slicing, structured seed cases, schema- or type-aware templates). Expected benefits: higher executability/coverage, fewer degenerate tests, and more stable outcomes [75,76].

O2—Retrieval-grounded planning at scale.

Standardize RAG interfaces for testing: UI artifacts (screens/DOM), domain rules, historical traces. Explore caching, indexing, and cost controllers to make large-scale acceptance testing practical [183,184].

O3—Execution/verification-in-the-loop.

Tight loops that use execution feedback (differential testing), verifiers, or CFG planners can raise correctness while bounding hallucinations. This approach also extends to hybrid analytical systems that integrate LLMs with other techniques (e.g., Bayesian networks for root cause analysis [226]) for complex, system-level tasks. Needed: reusable controllers, stopping criteria, and safety guards [177,182,185]. Recent work also shows that combining an LLM with a small classifier enables one-shot root cause analysis in cloud-native systems [227].

O4—Task-specific adaptation with accountability.

Lightweight adapters/LoRA or focused fine-tunes for FL and related tasks show clear Top-k gains; future work should pair them with explanation artifacts and calibration so developers can assess reliability [178,179].

O5—Leakage-aware benchmarking and reporting.

Adopt complementary, time-sliced corpora (e.g., for fault localization and program repair), publish prompts and seeds, and report multi-signal metrics (coverage, mutation, cost/latency, human assessment) to enable fair comparisons. Community artifacts and replication packages should make reproduction routine [19,20,96,228]. Complementary corpus-building pipelines—e.g., GlossAPI—offer practical blueprints for curating, annotating, and serving datasets to support reproducible FM studies [229].

O6—Human–AI collaboration patterns.

Design “explain–edit–enforce” workflows: (i) present rationales and diffs aligned to coding standards; (ii) support selective adoption with quick fixes [221], including strategies for evaluating multiple AI-generated options [211]; (iii) integrate risk gates in CI for security-sensitive outputs [191,196].

O7—Domain-specialised safety and governance.

For security and finance, codify guardrails (data minimisation, red-team tests, approval workflows) and monitor drift. Compact models tailored to domain artifacts are promising for on-prem and CPU-bound environments [184,191]. Where governance requires on-prem serving, wafer-scale LLM stacks such as WSC-LLM offer a high-throughput option that explicitly balances memory and interconnect resources to meet latency/cost constraints [230].

As a systematic review, this study does not aim to empirically validate individual foundation model techniques. Instead, its contribution is validated through the consistent patterns and conclusions observed across 224 independent peer-reviewed studies, which collectively reveal robust trends, recurring challenges, and gaps in current practice. The detailed threats to validity discussed in the following section further clarify the boundaries and assumptions under which these synthesized findings should be interpreted.

8.3. Threats to Validity

Before discussing threats related to the reviewed studies, we also reflect on possible threats to the validity of our own review methodology. Although the review protocol was designed to be systematic and reproducible, it is subject to certain limitations. The search strings, inclusion criteria, and database coverage may not have captured all relevant studies, particularly very recent or non-indexed work. To reduce this risk, we searched five major databases (IEEE Xplore, ACM DL, ScienceDirect, SpringerLink, and arXiv), used both manual and automated searches, and performed backward and forward snowballing. Two authors independently screened a random subset of papers, compared inclusion decisions, and resolved disagreements by discussion. The mapping of each paper to its life-cycle phase and capability dimension followed pre-defined coding rules to minimize subjectivity. We were unable to formally assess reporting bias because included studies differed widely in methodology and reporting. Potential publication bias is acknowledged as a limitation. We did not apply a certainty-of-evidence framework (e.g., GRADE) because such frameworks are designed for clinical studies and are not applicable to heterogeneous SE research. Finally, while the analysis window (2023–2025) provides a recent snapshot, foundation model research evolves rapidly; future reviews should revisit these mappings as the field matures. Moreover, substantial pre-2023 work exists on machine learning and early large language models in software engineering. Preliminary scoping identified dozens of such studies, which were excluded to focus on post-ChatGPT foundation models. This limits historical coverage but does not affect the phase–capability patterns driven by recent model advances.

The following paragraphs address threats and mitigations specific to the FM-in-testing studies included in our review.

Construct validity:

What is measured may not reflect the intended construct (e.g., line/branch coverage vs. bug-finding power). Prompt/seed sensitivity further threatens construct validity [54,88]. Implications for RQs: Misaligned measures can distort our mapping of where FMs help (RQ1) and which capabilities appear most effective (RQ2), and they may over/understate strengths in testing tasks (RQ3), thereby skewing priorities in future directions (RQ4). Mitigation: report prompts/seeds/sampling policies; include mutation score and real-bug detection where feasible to better support RQ1–RQ3 evidence quality and RQ4 recommendations.

Internal validity:

Data leakage, flaky tests, and uncontrolled confounders (time budget, temperature, model version) can bias results [20,96]. Implications for RQs: Such biases can inflate or deflate observed effectiveness, affecting our phase/capability synthesis (RQ1–RQ2) and the reliability of testing takeaways (RQ3), which in turn influences our research agenda (RQ4).

External validity:

Results may not generalise across languages (C/Java/Python), frameworks, or industrial scale (UI flows, microservices). Security studies may not transfer across ecosystems [183,184,191]. Implications for RQs: Limited generalisability constrains how broadly we can answer where and how FMs are used (RQ1–RQ2) and how confidently testing insights translate to practice (RQ3); it also motivates calls for broader, realistic evaluations (RQ4). Mitigation: multi-language, multi-domain evaluations; report cost/latency and infra assumptions; include at least one industrial or large-scale subject when possible to improve the external relevance of RQ1–RQ3 and inform RQ4.

Conclusion validity:

Effect sizes can be small and sensitive to outliers; multiple comparisons inflate error. Implications for RQs: Weak statistical grounding threatens the stability of capability comparisons (RQ2) and the strength-of-evidence claims in testing (RQ3), which may misprioritise future work (RQ4). Mitigation: use nonparametric tests, correct for multiplicity, report confidence intervals, and release raw results for re-analysis to bolster the inferential basis for RQ2–RQ3 and calibrate RQ4.

Reproducibility:

Access limits (closed models, API churn) and missing artifacts (prompts, seeds, harnesses) impede replication. Implications for RQs: Fragile reproducibility undermines longitudinal comparability for phase/capability trends (RQ1–RQ2) and weakens confidence in testing results (RQ3), hindering cumulative progress toward the agenda (RQ4). Mitigation: release prompts/harnesses; pin model/version; include open-model baselines alongside proprietary ones to preserve the durability of RQ1–RQ3 findings and enable RQ4 execution.

Summary. Making construct, internal, external, conclusion, and reproducibility safeguards explicit improves the fidelity of our lifecycle mapping (RQ1), the credibility of capability distributions (RQ2), the reliability of testing-specific insights (RQ3), and the practicality of the proposed research agenda (RQ4).

Methodological Novelty.

In contrast to earlier reviews that mostly provide descriptive lists of applications, this work introduces a structured framework for examining how foundation models are used within software engineering. By linking software lifecycle stages with the model functions applied in each, the taxonomy allows analysis in both directions: how certain functions extend across phases, and how the characteristics of each phase influence their effectiveness. This perspective transforms the taxonomy from a static classification into a practical tool for discovering new connections and shaping evidence-based research questions.

The testing synthesis demonstrates the use of this framework through task family mapping, a transparent and repeatable method that identifies where current evidence is concentrated, where benchmarks overlap, and how methods evolve. Together, these components form a broader structure for assessing the maturity and adaptability of research involving foundation models across the software lifecycle. The result is a methodological contribution that goes beyond previous summary-based surveys and encourages a more systematic approach to future work.

Impact and Research Implications.

The taxonomy serves not only as a summary of current work but also as a guide for shaping future research on foundation models in software engineering. By showing where model functions intersect with different stages of the software lifecycle, it helps reveal areas that have received limited attention and can inform priorities for funding, benchmark development, and evaluation design. For instance, the lack of studies connecting requirements analysis or architectural reasoning with model use points to methodological directions that remain largely unexplored. Researchers can use the taxonomy to design hypothesis-based studies that examine how model functions evolve across lifecycle stages, while practitioners can rely on it to identify model types that best fit specific development contexts. In this way, the taxonomy shifts foundation model research in software engineering from isolated task-focused experiments toward cumulative, phase-aware investigation, establishing a more coherent path for future studies. To turn these analytical insights into practice, we extend the discussion with a framework that offers concrete steps for application.

8.4. Actionable Framework for FM Adoption in Software Engineering

The taxonomy presented in this study explains how the functions of foundation models relate to the main phases of software engineering (SE). To turn these insights into practical use, both practitioners and researchers need a clear path for applying them within real projects. Building on the bidirectional taxonomy and the testing synthesis, we introduce a Reference Adoption Workflow that translates these ideas into a step-by-step process for introducing and assessing foundation models in SE environments. This workflow builds on earlier surveys such as Hou et al. [19], which organized studies by phase or model type, and Wang et al. [20], which focused mainly on testing. Our approach outlines how organizations can plan, adapt, and evaluate the use of foundation models across the full software lifecycle in a consistent way.

(1): Phase–Function Scoping.

Start by identifying the main SE phase of interest (for example, requirements analysis, design, or testing) and relate it to the most relevant model functions described in our taxonomy (for instance, reasoning, synthesis, or summarization). This helps ensure that model use begins with a clear match between the needs of each phase and what the model can provide, addressing the fragmentation noted in earlier descriptive studies [19].

(2): Data and Representation Alignment.

Before any adaptation or prompt design, check whether the data linked to the chosen SE phase (for example, requirement documents, source code, or bug reports) are suitable for the selected function. When differences arise, create intermediate forms—such as structured prompts, reformatted text, or synthetic examples—to bridge the gap between the data and the intended task, as discussed in Section 5.

(3): Integration and Iterative Evaluation.

Introduce the model in a human-in-the-loop setting and evaluate it repeatedly along three dimensions: (i) task accuracy, (ii) clarity and user control, and (iii) cost and efficiency. This process follows our findings from the testing synthesis, which showed that evaluation quality depends not only on accuracy measures but also on transparency and reproducibility. The workflow, therefore, encourages systematic and evidence-based evaluation.

(4): Cross-Phase Feedback.

Use the two-way reasoning structure of the taxonomy to identify how outcomes from one phase (for example, test results) can inform later stages (such as maintenance or documentation). This exchange between phases turns separate model uses into a continuous learning cycle rather than isolated applications.

(5): Institutionalization and Benchmarking.

When effective practices are established, formalize them as reusable templates, automated pipelines, or shared evaluation procedures. These outputs can be released as benchmark tasks or public datasets to promote replication and strengthen the evidence base, following the benchmarking and maturity goals discussed in prior studies [19,20] and our testing synthesis. Over time, this supports consistent tracking of progress and transferability within research on foundation models for SE.

This reference workflow transforms the taxonomy from a static classification into a repeatable adoption process that helps teams plan, apply, and review the use of foundation models throughout the software lifecycle. It provides both a conceptual guide and a practical structure for future studies, helping the field move from descriptive mapping toward tested and evidence-based integration.

9. Conclusions

This review mapped how large pretrained models are being used across the SE lifecycle and offered a closer look at their role in software testing. We introduced a two-dimensional taxonomy that links the lifecycle phase in which work is performed to the specific capability exercised. Applying this lens to 224 papers shows a clear concentration in implementation and testing activities, with smaller but growing use in requirements, design, and project practices. Our testing-focused analysis organized the literature into eight task families, summarized common methods and datasets, and highlighted where evidence is strongest and where it remains thin.

Taken together, the findings answer our research questions as follows: Current applications cover all phases of the lifecycle but are most mature in code-centric work (generation, testing, repair) and quality assurance. The capabilities that appear most often are code and test generation, defect detection, fault localization, and summarization/triage. Approaches that combine model outputs with execution traces, formal properties, or static analysis tend to be more reliable than one-shot prompting. The main limitations arise from sensitivity to inputs, uneven correctness, benchmark leakage, and the effort required to ground tests in large systems.

Looking forward, several promising directions can advance both practice and research. Adding structure before generation (e.g., slicing and seed cases), retrieving local project artifacts to ground decisions, keeping execution or verification in the loop, adapting models narrowly to each task, and adopting transparent, leakage-aware evaluation emerge as key priorities. For practitioners, these directions translate into actionable workflows: structure prompts rather than relying on ad hoc queries; integrate retrieval from project artifacts such as tests, logs, or requirements; and embed execution or verification directly in IDEs and CI pipelines. This ensures that foundation models serve as reliable assistants that enhance coverage and productivity while respecting cost, governance, and trust constraints. For researchers, the same directions motivate the development of leakage-aware benchmarks, evaluation protocols that go beyond one-shot correctness, and studies of human–AI collaboration in realistic settings. Lightweight task-specific adaptation and longitudinal investigations of developer outcomes are especially important to build cumulative and transferable knowledge. Collectively, these directions encourage a transition from isolated FM experimentation toward reproducible, lifecycle-aware engineering of AI-assisted software systems.

Our coverage reflects the period and sources we searched, and the taxonomy can be updated as new work appears. In summary, the evidence suggests that foundation models are already reshaping SE, particularly in testing, but their full potential depends on structured generation, grounded evaluation, and human-centered integration. By separating practitioner-oriented guidance from research challenges, this review provides a roadmap that is both immediately actionable in industry and generative of new scholarly inquiry.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info17010073/s1. PRISMA 2020 Checklist. Reference [148] is cited in the supplementary materials.

Author Contributions

Conceptualization, S.B.; data curation, S.B., M.D., H.A. and M.A.; formal analysis, S.B., M.D., H.A. and M.A.; investigation, S.B., M.D., H.A. and M.A.; writing—original draft preparation, S.B., M.D., H.A. and M.A.; writing—review and editing, S.B., M.D., H.A. and M.A.; visualization, S.B., M.D., H.A. and M.A.; supervision, S.B.; project administration, S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was not externally funded.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are derived from publicly available research articles included in the review. The extracted and coded data used for analysis are available from the corresponding author upon request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive feedback, which helped improve the clarity and rigor of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Sauvola, J.; Tarkoma, S.; Klemettinen, M.; Riekki, J.; Doermann, D. Future of software development with generative AI. Autom. Softw. Eng. 2024, 31, 26. [Google Scholar] [CrossRef]
Liu, Y.; Lo, S.K.; Lu, Q.; Zhu, L.; Zhao, D.; Xu, X.; Harrer, S.; Whittle, J. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents. J. Syst. Softw. 2025, 220, 112278. [Google Scholar] [CrossRef]
Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program synthesis with large language models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
Yan, D.; Gao, Z.; Liu, Z. A closer look at different difficulty levels code generation abilities of chatgpt. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1887–1898. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Liang, J.T.; Badea, C.; Bird, C.; DeLine, R.; Ford, D.; Forsgren, N.; Zimmermann, T. Can gpt-4 replicate empirical software engineering research? Proc. ACM Softw. Eng. 2024, 1, 1330–1353. [Google Scholar] [CrossRef]
Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. Starcoder: May the source be with you! arXiv 2023, arXiv:2305.06161. [Google Scholar] [CrossRef]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Wang, W.; Wang, Y.; Joty, S.; Hoi, S.C. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 5–7 December 2023; pp. 146–158. [Google Scholar]
Li, J.; Tao, C.; Li, J.; Li, G.; Jin, Z.; Zhang, H.; Fang, Z.; Liu, F. Large language model-aware in-context learning for code generation. ACM Trans. Softw. Eng. Methodol. 2023, 34, 190. [Google Scholar] [CrossRef]
Xu, W.; Gao, K.; He, H.; Zhou, M. Licoeval: Evaluating llms on license compliance in code generation. arXiv 2024, arXiv:2408.02487. [Google Scholar]
Bui, T.D.; Vu, T.T.; Nguyen, T.T.; Nguyen, S.; Vo, H.D. Correctness assessment of code generated by Large Language Models using internal representations. J. Syst. Softw. 2025, 230, 112570. [Google Scholar] [CrossRef]
Banh, L.; Holldack, F.; Strobel, G. Copiloting the future: How generative AI transforms Software Engineering. Inf. Softw. Technol. 2025, 183, 107751. [Google Scholar] [CrossRef]
Da Silva, L.; Samhi, J.; Khomh, F. LLMs and Stack Overflow discussions: Reliability, impact, and challenges. J. Syst. Softw. 2025, 230, 112541. [Google Scholar] [CrossRef]
Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-thought in neural code generation: From and for lightweight language models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
Alagarsamy, S.; Tantithamthavorn, C.; Takerngsaksiri, W.; Arora, C.; Aleti, A. Enhancing large language models for text-to-testcase generation. J. Syst. Softw. 2025, 230, 112531. [Google Scholar] [CrossRef]
Basha, M.; Rodríguez-Pérez, G. Trust, transparency, and adoption in generative AI for software engineering: Insights from Twitter discourse. Inf. Softw. Technol. 2025, 186, 107804. [Google Scholar] [CrossRef]
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 220. [Google Scholar] [CrossRef]
Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software testing with large language models: Survey, landscape, and vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
Hemmat, A.; Sharbaf, M.; Kolahdouz-Rahimi, S.; Lano, K.; Tehrani, S.Y. Research directions for using LLM in software requirement engineering: A systematic review. Front. Comput. Sci. 2025, 7, 1519437. [Google Scholar] [CrossRef]
Rasnayaka, S.; Wang, G.; Shariffdeen, R.; Iyer, G.N. An empirical study on usage and perceptions of llms in a software engineering project. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 111–118. [Google Scholar]
Wei, B. Requirements are all you need: From requirements to code with llms. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 416–422. [Google Scholar]
Lubos, S.; Felfernig, A.; Tran, T.N.T.; Garber, D.; El Mansi, M.; Erdeniz, S.P.; Le, V.M. Leveraging llms for the quality assurance of software requirements. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 389–397. [Google Scholar]
Krishna, M.; Gaur, B.; Verma, A.; Jalote, P. Using llms in software requirements specifications: An empirical evaluation. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 475–483. [Google Scholar]
Feng, N.; Marsso, L.; Yaman, S.G.; Standen, I.; Baatartogtokh, Y.; Ayad, R.; De Mello, V.O.; Townsend, B.; Bartels, H.; Cavalcanti, A.; et al. Normative requirements operationalization with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 129–141. [Google Scholar]
Mu, F.; Shi, L.; Wang, S.; Yu, Z.; Zhang, B.; Wang, C.; Liu, S.; Wang, Q. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification. Proc. ACM Softw. Eng. 2024, 1, 2332–2354. [Google Scholar] [CrossRef]
Ferrari, A.; Spoletini, P. Formal requirements engineering and large language models: A two-way roadmap. Inf. Softw. Technol. 2025, 181, 107697. [Google Scholar] [CrossRef]
Dong, Y.; Kong, L.; Zhang, L.; Wang, S.; Liu, X.; Liu, S.; Chen, M. A search-and-fill strategy to code generation for complex software requirements. Inf. Softw. Technol. 2025, 177, 107584. [Google Scholar] [CrossRef]
Hassani, S.; Sabetzadeh, M.; Amyot, D. An empirical study on LLM-based classification of requirements-related provisions in food-safety regulations. Empir. Softw. Eng. 2025, 30, 72. [Google Scholar] [CrossRef]
Odu, O.; Belle, A.B.; Wang, S.; Kpodjedo, S.; Lethbridge, T.C.; Hemmati, H. Automatic instantiation of assurance cases from patterns using large language models. J. Syst. Softw. 2025, 222, 112353. [Google Scholar] [CrossRef]
Maranhão, J.J.; Guerra, E.M. A prompt pattern sequence approach to apply generative AI in assisting software architecture decision-making. In Proceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices, Irsee, Germany, 3–7 July 2024; pp. 1–12. [Google Scholar]
Zhao, J.; Yang, Z.; Zhang, L.; Lian, X.; Yang, D.; Tan, X. DRMiner: Extracting latent design rationale from Jira issue logs. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 468–480. [Google Scholar]
Ahlgren, T.L.; Sunde, H.F.; Kemell, K.K.; Nguyen-Duc, A. Assisting early-stage software startups with LLMs: Effective prompt engineering and system instruction design. Inf. Softw. Technol. 2025, 187, 107832. [Google Scholar] [CrossRef]
Cordeiro, J.; Noei, S.; Zou, Y. An empirical study on the code refactoring capability of large language models. arXiv 2024, arXiv:2411.02320. [Google Scholar] [CrossRef]
Ishizue, R.; Sakamoto, K.; Washizaki, H.; Fukazawa, Y. Improved program repair methods using refactoring with GPT models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education, Portland, OR, USA, 20–23 March 2024; Volume 1, pp. 569–575. [Google Scholar]
Pomian, D.; Bellur, A.; Dilhara, M.; Kurbatova, Z.; Bogomolov, E.; Sokolov, A.; Bryksin, T.; Dig, D. Em-assist: Safe automated extractmethod refactoring with llms. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, 15–19 July 2024; pp. 582–586. [Google Scholar]
Wu, D.; Mu, F.; Shi, L.; Guo, Z.; Liu, K.; Zhuang, W.; Zhong, Y.; Zhang, L. ismell: Assembling llms with expert toolsets for code smell detection and refactoring. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1345–1357. [Google Scholar]
Xu, K.; Zhang, G.L.; Yin, X.; Zhuo, C.; Schlichtmann, U.; Li, B. HLSRewriter: Efficient Refactoring and Optimization of C/C++ Code with LLMs for High-Level Synthesis. ACM Trans. Des. Autom. Electron. Syst. 2025. [Google Scholar] [CrossRef]
Zhao, J.; Song, Y.; Cohen, E. Variational Prefix Tuning for diverse and accurate code summarization using pre-trained language models. J. Syst. Softw. 2025, 229, 112493. [Google Scholar] [CrossRef]
Zubair, F.; Al-Hitmi, M.; Catal, C. The use of large language models for program repair. Comput. Stand. Interfaces 2025, 93, 103951. [Google Scholar] [CrossRef]
Jin, M.; Shahriar, S.; Tufano, M.; Shi, X.; Lu, S.; Sundaresan, N.; Svyatkovskiy, A. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 5–7 December 2023; pp. 1646–1656. [Google Scholar]
Luo, W.; Keung, J.; Yang, B.; Ye, H.; Le Goues, C.; Bissyande, T.F.; Tian, H.; Le, X.B.D. When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair. ACM Trans. Softw. Eng. Methodol. 2024. [Google Scholar] [CrossRef]
Li, H.; Hao, Y.; Zhai, Y.; Qian, Z. Enhancing static analysis for practical bug detection: An llm-integrated approach. Proc. ACM Program. Lang. 2024, 8, 474–499. [Google Scholar] [CrossRef]
Guan, H.; Bai, G.; Liu, Y. CrossProbe: LLM-Empowered Cross-Project Bug Detection for Deep Learning Frameworks. Proc. ACM Softw. Eng. 2025, 2, 2430–2452. [Google Scholar] [CrossRef]
Huang, K.; Meng, X.; Zhang, J.; Liu, Y.; Wang, W.; Li, S.; Zhang, Y. An empirical study on fine-tuning large language models of code for automated program repair. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1162–1174. [Google Scholar]
Huang, K.; Zhang, J.; Meng, X.; Liu, Y. Template-guided program repair in the era of large language models. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2025; pp. 367–379. [Google Scholar]
Li, G.; Zhi, C.; Chen, J.; Han, J.; Deng, S. Exploring parameter-efficient fine-tuning of large language model on automated program repair. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 719–731. [Google Scholar]
Kong, J.; Xie, X.; Liu, S. Demystifying Memorization in LLM-Based Program Repair via a General Hypothesis Testing Framework. Proc. ACM Softw. Eng. 2025, 2, 2712–2734. [Google Scholar] [CrossRef]
Lajkó, M.; Csuvik, V.; Gyimothy, T.; Vidács, L. Automated program repair with the gpt family, including gpt-2, gpt-3 and codex. In Proceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair, Lisbon, Portugal, 20 April 2024; pp. 34–41. [Google Scholar]
Xiao, J.; Xu, Z.; Chen, S.; Lei, G.; Fan, G.; Cao, Y.; Deng, S.; Feng, Z. Confix: Combining node-level fix templates and masked language model for automatic program repair. J. Syst. Softw. 2024, 216, 112116. [Google Scholar] [CrossRef]
Zhang, Y.; Jin, Z.; Xing, Y.; Li, G.; Liu, F.; Zhu, J.; Dou, W.; Wei, J. PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing. ACM Trans. Softw. Eng. Methodol. 2025, 35, 3. [Google Scholar] [CrossRef]
Shivashankar, K.; Orucevic, M.; Kruke, M.M.; Martini, A. BEACon-TD: Classifying Technical Debt and its types across diverse software projects issues using transformers. J. Syst. Softw. 2025, 226, 112435. [Google Scholar] [CrossRef]
Ouédraogo, W.C.; Kaboré, K.; Li, Y.; Tian, H.; Koyuncu, A.; Klein, J.; Lo, D.; Bissyandé, T.F. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation. arXiv 2024, arXiv:2407.00225. [Google Scholar] [CrossRef]
Bose, D.B. From Prompts to Properties: Rethinking LLM Code Generation with Property-Based Testing. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1660–1665. [Google Scholar]
Huang, D.; Zhang, J.M.; Bu, Q.; Xie, X.; Chen, J.; Cui, H. Bias testing and mitigation in llm-based code generation. ACM Trans. Softw. Eng. Methodol. 2024, 35, 5. [Google Scholar] [CrossRef]
Boukhlif, M.; Kharmoum, N.; Hanine, M. Llms for intelligent software testing: A comparative study. In Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security, Meknes, Morocco, 18–19 April 2024; pp. 1–8. [Google Scholar]
Liao, Y.; Zhang, J.; Keung, J.; Xiao, Y.; Dai, Y. Advancing autonomous driving system testing: Demands, challenges, and future directions. Inf. Softw. Technol. 2025, 187, 107859. [Google Scholar] [CrossRef]
Dakhel, A.M.; Nikanjam, A.; Majdinasab, V.; Khomh, F.; Desmarais, M.C. Effective test generation using pre-trained Large Language Models and mutation testing. Inf. Softw. Technol. 2024, 171, 107468. [Google Scholar] [CrossRef]
Ihalage, A.; Taheri, S.; Muhammad, F.; Al-Raweshidy, H. Convolutional Versus Large Language Models for Software Log Classification in Edge-Deployable Cellular Network Testing. IEEE Access 2025, 13, 134283–134296. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, T.Y.; Pike, M.; Towey, D.; Ying, Z.; Zhou, Z.Q. Enhancing autonomous driving simulations: A hybrid metamorphic testing framework with metamorphic relations generated by GPT. Inf. Softw. Technol. 2025, 187, 107828. [Google Scholar] [CrossRef]
Altin, M.; Mutlu, B.; Kilinc, D.; Cakir, A. Automated Testing for Service-Oriented Architecture: Leveraging Large Language Models for Enhanced Service Composition. IEEE Access 2025, 13, 89627–89640. [Google Scholar] [CrossRef]
De Siano, G.D.; Fasolino, A.R.; Sperlí, G.; Vignali, A. Translating code with Large Language Models and human-in-the-loop feedback. Inf. Softw. Technol. 2025, 186, 107785. [Google Scholar] [CrossRef]
Sasaki, Y.; Washizaki, H.; Li, J.; Sander, D.; Yoshioka, N.; Fukazawa, Y. Systematic literature review of prompt engineering patterns in software engineering. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 670–675. [Google Scholar]
Felizardo, K.R.; Steinmacher, I.; Lima, M.S.; Deizepe, A.; Conte, T.U.; Barcellos, M.P. Data extraction for systematic mapping study using a large language model-a proof-of-concept study in software engineering. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Barcelona, Spain, 24–25 October 2024; pp. 407–413. [Google Scholar]
Khan, Z.U.; Nasim, B.; Rasheed, Z. Generative AI-based predictive maintenance in aviation: A systematic literature review. Ceas Aeronaut. J. 2025, 16, 537–555. [Google Scholar] [CrossRef]
Garcia, M.B. Teaching and learning computer programming using ChatGPT: A rapid review of literature amid the rise of generative AI technologies. Educ. Inf. Technol. 2025, 30, 16721–16745. [Google Scholar] [CrossRef]
Lin, F.; Kim, D.J. Soen-101: Code generation by emulating software process models using large language model agents. arXiv 2024, arXiv:2403.15852. [Google Scholar]
Li, X.; Yuan, S.; Gu, X.; Chen, Y.; Shen, B. Few-shot code translation via task-adapted prompt learning. J. Syst. Softw. 2024, 212, 112002. [Google Scholar] [CrossRef]
Pornprasit, C.; Tantithamthavorn, C. Fine-tuning and prompt engineering for large language models-based code review automation. Inf. Softw. Technol. 2024, 175, 107523. [Google Scholar] [CrossRef]
Yang, Z.; Keung, J.W.; Sun, Z.; Zhao, Y.; Li, G.; Jin, Z.; Liu, S.; Li, Y. Improving domain-specific neural code generation with few-shot meta-learning. Inf. Softw. Technol. 2024, 166, 107365. [Google Scholar] [CrossRef]
Yun, S.; Lin, S.; Gu, X.; Shen, B. Project-specific code summarization with in-context learning. J. Syst. Softw. 2024, 216, 112149. [Google Scholar] [CrossRef]
Eagal, A.; Stolee, K.T.; Ore, J.P. Analyzing the dependability of Large Language Models for code clone generation. J. Syst. Softw. 2025, 230, 112548. [Google Scholar] [CrossRef]
Pan, Y.; Lyu, C.; Yang, Z.; Li, L.; Liu, Q.; Shao, X. E-code: Mastering efficient code generation through pretrained models and expert encoder group. Inf. Softw. Technol. 2025, 178, 107602. [Google Scholar] [CrossRef]
Liu, J.; Li, C.; Chen, R.; Li, S.; Gu, B.; Yang, M. STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs. Proc. ACM Softw. Eng. 2025, 2, 2113–2135. [Google Scholar] [CrossRef]
Wang, Z.; Liu, K.; Li, G.; Jin, Z. Hits: High-coverage llm-based unit test generation via method slicing. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1258–1268. [Google Scholar]
Su, C.Y.; Bansal, A.; Huang, Y.; Li, T.J.J.; McMillan, C. Context-aware code summary generation. J. Syst. Softw. 2025, 231, 112580. [Google Scholar] [CrossRef]
Kim, D.K.; Ming, H. Assessing output reliability and similarity of large language models in software development: A comparative case study approach. Inf. Softw. Technol. 2025, 185, 107787. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, C.; Wang, Y.; Shi, E.; Ma, Y.; Zhong, W.; Chen, J.; Mao, M.; Zheng, Z. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation. Proc. ACM Softw. Eng. 2025, 2, 481–503. [Google Scholar] [CrossRef]
Khanshan, A.; Van Gorp, P.; Markopoulos, P. Evaluation of Code Generation for Simulating Participant Behavior in Experience Sampling Method by Iterative In-Context Learning of a Large Language Model. Proc. ACM Hum.-Comput. Interact. 2024, 8, 255. [Google Scholar] [CrossRef]
Wang, J.; Liu, S.; Xie, X.; Li, Y. An empirical study to evaluate AIGC detectors on code content. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 844–856. [Google Scholar]
Firouzi, E.; Ghafari, M. Time to separate from StackOverflow and match with ChatGPT for encryption. J. Syst. Softw. 2024, 216, 112135. [Google Scholar] [CrossRef]
Qu, Y.; Huang, S.; Chen, X.; Bai, T.; Yao, Y. An input-denoising-based defense against stealthy backdoor attacks in large language models for code. Inf. Softw. Technol. 2025, 180, 107661. [Google Scholar] [CrossRef]
Moumoula, M.B.; Kabore, A.K.; Klein, J.; Bissyande, T.F. Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November; ACM: New York, NY, USA, 2024; pp. 2474–2475. [Google Scholar] [CrossRef]
Durán, F.; Martinez, M.; Lago, P.; Martínez-Fernández, S. Insights into resource utilization of code small language models serving with runtime engines and execution providers. J. Syst. Softw. 2025, 230, 112574. [Google Scholar] [CrossRef]
Voria, G.; Casillo, F.; Gravino, C.; Catolino, G.; Palomba, F. RECOVER: Toward Requirements Generation from Stakeholders’ Conversations. IEEE Trans. Softw. Eng. 2025, 51, 1912–1933. [Google Scholar] [CrossRef]
Nikolakopoulos, A.; Litke, A.; Psychas, A.; Veroni, E.; Varvarigou, T. Exploring the potential of offline llms in data science: A study on code generation for data analysis. IEEE Access 2025, 13, 64087–64114. [Google Scholar] [CrossRef]
Schäfer, M.; Nadi, S.; Eghbali, A.; Tip, F. An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Softw. Eng. 2023, 50, 85–105. [Google Scholar] [CrossRef]
Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359. [Google Scholar] [CrossRef]
Rahman, S.; Kuhar, S.; Cirisci, B.; Garg, P.; Wang, S.; Ma, X.; Deoras, A.; Ray, B. UTFix: Change aware unit test repairing using LLM. Proc. ACM Program. Lang. 2025, 9, 143–168. [Google Scholar] [CrossRef]
Ardimento, P.; Capuzzimati, M.; Casalino, G.; Schicchi, D.; Taibi, D. A novel LLM-based classifier for predicting bug-fixing time in Bug Tracking Systems. J. Syst. Softw. 2025, 230, 112569. [Google Scholar] [CrossRef]
Nguyen, T.T.; Vu, T.T.; Vo, H.D.; Nguyen, S. An empirical study on capability of Large Language Models in understanding code semantics. Inf. Softw. Technol. 2025, 185, 107780. [Google Scholar] [CrossRef]
Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R. Automating the correctness assessment of AI-generated code for security contexts. J. Syst. Softw. 2024, 216, 112113. [Google Scholar] [CrossRef]
Moumoula, M.B.; Kaboré, A.K.; Klein, J.; Bissyandé, T.F. The Struggles of LLMs in Cross-Lingual Code Clone Detection. Proc. ACM Softw. Eng. 2025, 2, 1023–1045. [Google Scholar] [CrossRef]
Dil, C.; Chen, H.; Damevski, K. Towards higher quality software vulnerability data using LLM-based patch filtering. J. Syst. Softw. 2025, 230, 112581. [Google Scholar] [CrossRef]
Wu, Y.; Li, Z.; Zhang, J.M.; Liu, Y. Condefects: A complementary dataset to address the data leakage concern for llm-based fault localization and program repair. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, 15–19 July 2024; pp. 642–646. [Google Scholar]
Wang, R.; Guo, J.; Gao, C.; Fan, G.; Chong, C.Y.; Xia, X. Can llms replace human evaluators? An empirical study of llm-as-a-judge in software engineering. Proc. ACM Softw. Eng. 2025, 2, 1955–1977. [Google Scholar] [CrossRef]
Kalouptsoglou, I.; Siavvas, M.; Ampatzoglou, A.; Kehagias, D.; Chatzigeorgiou, A. Transfer learning for software vulnerability prediction using Transformer models. J. Syst. Softw. 2025, 227, 112448. [Google Scholar] [CrossRef]
Xiong, H.; Yang, Y.; Wu, H.; Zhong, X.; Tang, Y.; Xia, Z.; Wang, X.; Yan, J. Reinvent the Operation not the Architecture: Quantum-inspired High-order Product for Compatible and Improved LLMs Training. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, Canada, 3–7 August 2025; Volume 2, pp. 3356–3365. [Google Scholar]
Tu, H.; Zhou, Z.; Jiang, H.; Yusuf, I.N.B.; Li, Y.; Jiang, L. Isolating compiler bugs by generating effective witness programs with large language models. IEEE Trans. Softw. Eng. 2024, 50, 1768–1788. [Google Scholar] [CrossRef]
Ge, C.; Wang, T.; Yang, X.; Treude, C. Cross-Level Requirements Tracing Based on Large Language Models. IEEE Trans. Softw. Eng. 2025, 51, 2044–2066. [Google Scholar] [CrossRef]
Fazelnia, M.; Mirakhorli, M.; Bagheri, H. Translation titans, reasoning challenges: Satisfiability-aided language models for detecting conflicting requirements. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2294–2298. [Google Scholar]
Hassani, S. Enhancing legal compliance and regulation analysis with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 507–511. [Google Scholar]
Wu, J.J.; Fard, F.H. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agent. arXiv 2024, arXiv:2406.00215. [Google Scholar] [CrossRef]
Tagliaferro, A.; Corboe, S.; Guindani, B. Leveraging LLMs to Automate Software Architecture Design from Informal Specifications. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 291–299. [Google Scholar]
Duarte, C.E. Automated Microservice Pattern Instance Detection Using Infrastructure-as-Code Artifacts and Large Language Models. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 161–166. [Google Scholar]
Ou, Y.; Su, C.; Chen, L.; Li, Y.; Zhou, Y. Binding of C++ and JavaScript through automated glue code generation. J. Syst. Softw. 2025, 230, 112565. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Shi, E.; Zhong, W.; Zhang, H.; Chen, J.; Zhang, R.; Ma, Y.; Zheng, Z. When to stop? towards efficient code generation in llms with excess token prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1073–1085. [Google Scholar]
Yu, Z.; Li, C.; Zhang, Y.; Liu, M.; Pinckney, N.; Zhou, W.; Yang, H.; Liang, R.; Ren, H.; Lin, Y.C. LLM4HWDesign Contest: Constructing a Comprehensive Dataset for LLM-Assisted Hardware Code Generation with Community Efforts. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 1 October–1 November 2024; pp. 1–5. [Google Scholar]
Birillo, A.; Artser, E.; Potriasaeva, A.; Vlasov, I.; Dzialets, K.; Golubev, Y.; Gerasimov, I.; Keuning, H.; Bryksin, T. One step at a time: Combining llms and static analysis to generate next-step hints for programming tasks. In Proceedings of the 24th Koli Calling International Conference on Computing Education Research, Koli, Finland, 12–17 November 2024; pp. 1–12. [Google Scholar]
Almanasra, S.; Suwais, K. Analysis of ChatGPT-generated codes across multiple programming languages. IEEE Access 2025, 13, 23580–23596. [Google Scholar] [CrossRef]
Luo, Y.; Yu, R.; Zhang, F.; Liang, L.; Xiong, Y. Bridging gaps in llm code translation: Reducing errors with call graphs and bridged debuggers. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2448–2449. [Google Scholar]
Imran, M.M.; Chatterjee, P.; Damevski, K. Shedding light on software engineering-specific metaphors and idioms. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
Yu, L.; Huang, Z.; Yuan, H.; Cheng, S.; Yang, L.; Zhang, F.; Shen, C.; Ma, J.; Zhang, J.; Lu, J.; et al. Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection. Proc. ACM Softw. Eng. 2025, 2, 182–205. [Google Scholar] [CrossRef]
Zhu, Y.; Yu, S.; Zong, Z.; Wang, Y.; Zhao, Y.; Chen, Z. Text-image fusion template for large language model assisted crowdsourcing test aggregation. J. Syst. Softw. 2025, 228, 112478. [Google Scholar] [CrossRef]
Bukhary, N.; Ahmad, M.; Rashad, K.; Rai, S.; Shapsough, S.; Kaddoura, Y.; Dghaym, D.; Zualkernan, I. Few-Shot Evaluation of Vision Language Models for Detecting Visual Defects in Autonomous Vehicle Software Requirement Specifications. IEEE Access 2025, 13, 117914–117942. [Google Scholar] [CrossRef]
Xiang, B.; Shao, Y. SUMLLAMA: Efficient Contrastive Representations and Fine-Tuned Adapters for Bug Report Summarization. IEEE Access 2024, 12, 78562–78571. [Google Scholar] [CrossRef]
Sun, T.; Xu, J.; Li, Y.; Yan, Z.; Zhang, G.; Xie, L.; Geng, L.; Wang, Z.; Chen, Y.; Lin, Q.; et al. Bitsai-cr: Automated code review via llm in practice. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 274–285. [Google Scholar]
Li, Y.; Liu, B.; Zhang, T.; Wang, Z.; Lo, D.; Yang, L.; Lyu, J.; Zhang, H. A Knowledge Enhanced Large Language Model for Bug Localization. Proc. ACM Softw. Eng. 2025, 2, 1914–1936. [Google Scholar] [CrossRef]
Boi, B.; Esposito, C.; Lee, S. Smart contract vulnerability detection: The role of large language model (llm). ACM SIGAPP Appl. Comput. Rev. 2024, 24, 19–29. [Google Scholar] [CrossRef]
Kessel, M.; Atkinson, C. Promoting open science in test-driven software experiments. J. Syst. Softw. 2024, 212, 111971. [Google Scholar] [CrossRef]
Bin Murtaza, S.; Mccoy, A.; Ren, Z.; Murphy, A.; Banzhaf, W. Llm fault localisation within evolutionary computation based automated program repair. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Melbourne, Australia, 14–18 July 2024; pp. 1824–1829. [Google Scholar]
Ouedraogo, W.C.; Kabore, K.; Tian, H.; Song, Y.; Koyuncu, A.; Klein, J.; Lo, D.; Bissyande, T.F. Llms and prompting for unit test generation: A large-scale evaluation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2464–2465. [Google Scholar]
Eshghie, M.; Artho, C. Oracle-guided vulnerability diversity and exploit synthesis of smart contracts using llms. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2240–2248. [Google Scholar]
Huang, K.; Zhang, J.; Bao, X.; Wang, X.; Liu, Y. Comprehensive Fine-Tuning Large Language Models of Code for Automated Program Repair. IEEE Trans. Softw. Eng. 2025, 51, 904–928. [Google Scholar] [CrossRef]
Soud, M.; Nuutinen, W.; Liebel, G. Sóley: Automated detection of logic vulnerabilities in Ethereum smart contracts using large language models. J. Syst. Softw. 2025, 226, 112406. [Google Scholar] [CrossRef]
Li, X.; Wang, S.; Li, S.; Ma, J.; Yu, J.; Liu, X.; Wang, J.; Ji, B.; Zhang, W. Model editing for llms4code: How far are we? arXiv 2024, arXiv:2411.06638. [Google Scholar] [CrossRef]
Kumar, J.; Chimalakonda, S. Code summarization without direct access to code-towards exploring federated llms for software engineering. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 100–109. [Google Scholar]
Ahmed, T.; Devanbu, P. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Rochester, MI, USA, 10–14 October 2022; pp. 1–5. [Google Scholar]
Yang, Y.; Zhou, X.; Mao, R.; Xu, J.; Yang, L.; Zhang, Y.; Shen, H.; Zhang, H. DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection. J. Syst. Softw. 2025, 219, 112234. [Google Scholar] [CrossRef]
Cai, Z.; Chen, J.; Chen, W.; Wang, W.; Zhu, X.; Ouyang, A. F-codellm: A federated learning framework for adapting large language models to practical software development. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 416–417. [Google Scholar]
Xia, C.S.; Deng, Y.; Dunn, S.; Zhang, L. Demystifying llm-based software engineering agents. Proc. ACM Softw. Eng. 2025, 2, 801–824. [Google Scholar] [CrossRef]
Alami, A.; Jensen, V.V.; Ernst, N.A. Accountability in code review: The role of intrinsic drivers and the impact of llms. ACM Trans. Softw. Eng. Methodol. 2025, 34, 233. [Google Scholar] [CrossRef]
Cinkusz, K.; Chudziak, J.A. Towards LLM-augmented multiagent systems for agile software engineering. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2476–2477. [Google Scholar]
Husain, M.; Khan, M.S.; Khan, J.A.; Khan, N.D.; Khan, A.; Akbar, M.A. Exploring Developers Discussion Forums for Quantum Software Engineering: A Fine-Grained Classification Approach Using Large Language Model (ChatGPT). In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1742–1755. [Google Scholar]
Ahmed, T.; Pai, K.S.; Devanbu, P.; Barr, E.T. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024; IEEE Computer Society: Washington, DC, USA, 2024; p. 1004. [Google Scholar]
Zhang, Y.; Qiu, Z.; Stol, K.J.; Zhu, W.; Zhu, J.; Tian, Y.; Liu, H. Automatic commit message generation: A critical review and directions for future work. IEEE Trans. Softw. Eng. 2024, 50, 816–835. [Google Scholar] [CrossRef]
Tufano, R.; Dabić, O.; Mastropaolo, A.; Ciniselli, M.; Bavota, G. Code review automation: Strengths and weaknesses of the state of the art. IEEE Trans. Softw. Eng. 2024, 50, 338–353. [Google Scholar] [CrossRef]
Estévez-Ayres, I.; Callejo, P.; Hombrados-Herrera, M.Á.; Alario-Hoyos, C.; Delgado Kloos, C. Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming. Int. J. Artif. Intell. Educ. 2024, 35, 774–790. [Google Scholar] [CrossRef]
Choi, S.; Kim, H. The impact of a large language model-based programming learning environment on students’ motivation and programming ability. Educ. Inf. Technol. 2024, 30, 8109–8138. [Google Scholar] [CrossRef]
Ahmed, T.; Devanbu, P. Better patching using llm prompting, via self-consistency. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1742–1746. [Google Scholar]
Ságodi, Z.; Siket, I.; Ferenc, R. Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and copilot. IEEE Access 2024, 12, 72303–72316. [Google Scholar] [CrossRef]
Hassani, S.; Sabetzadeh, M.; Amyot, D.; Liao, J. Rethinking legal compliance automation: Opportunities with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 432–440. [Google Scholar]
Colavito, G.; Lanubile, F.; Novielli, N. Benchmarking large language models for automated labeling: The case of issue report classification. Inf. Softw. Technol. 2025, 184, 107758. [Google Scholar] [CrossRef]
Cai, Y.; Liang, P.; Wang, Y.; Li, Z.; Shahin, M. Demystifying issues, causes and solutions in LLM open-source projects. J. Syst. Softw. 2025, 227, 112452. [Google Scholar] [CrossRef]
Yan, M.; Chen, J.; Zhang, J.M.; Cao, X.; Yang, C.; Harman, M. Robustness evaluation of code generation systems via concretizing instructions. Inf. Softw. Technol. 2025, 179, 107645. [Google Scholar] [CrossRef]
Ma, Q.; Peng, W.; Yang, C.; Shen, H.; Koedinger, K.; Wu, T. What should we engineer in prompts? training humans in requirement-driven llm use. ACM Trans. Comput.-Hum. Interact. 2025, 32, 41. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Xia, Y.; Xiao, Z.; Jazdi, N.; Weyrich, M. Generation of asset administration shell with large language model agents: Toward semantic interoperability in digital twins in the context of industry 4.0. IEEE Access 2024, 12, 84863–84877. [Google Scholar] [CrossRef]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Technical Report; ver. 2.3; EBSE Technical Report. EBSE. 2007. Available online: https://www.researchgate.net/publication/302924724_Guidelines_for_performing_Systematic_Literature_Reviews_in_Software_Engineering (accessed on 4 December 2025).
Bouzenia, I.; Devanbu, P.; Pradel, M. Repairagent: An autonomous, llm-based agent for program repair. arXiv 2024, arXiv:2403.17134. [Google Scholar]
Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
Braconaro, E.; Losiouk, E. A Dataset for Evaluating LLMs Vulnerability Repair Performance in Android Applications: Data/Toolset paper. In Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy, Pittsburgh, PA, USA, 4–6 June 2024; pp. 353–358. [Google Scholar]
Hagel, N.; Hili, N.; Bartel, A.; Koziolek, A. Towards LLM-Powered Consistency in Model-Based Low-Code Platforms. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 364–369. [Google Scholar]
Muttillo, V.; Di Sipio, C.; Rubei, R.; Berardinelli, L.; Dehghani, M. Towards synthetic trace generation of modeling operations using in-context learning approach. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 619–630. [Google Scholar]
van Can, A.T.; Dalpiaz, F. Locating requirements in backlog items: Content analysis and experiments with large language models. Inf. Softw. Technol. 2025, 179, 107644. [Google Scholar] [CrossRef]
Hassine, J. An llm-based approach to recover traceability links between security requirements and goal models. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 643–651. [Google Scholar]
Anandayuvaraj, D.; Campbell, M.; Tewari, A.; Davis, J.C. Fail: Analyzing software failures from the news using llms. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 506–518. [Google Scholar]
Chomątek, Ł.; Papuga, J.; Nowak, P.; Poniszewska-Marańda, A. Decoding CI/CD Practices in Open-Source Projects with LLM Insights. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1638–1644. [Google Scholar]
Ling, Y.; Yu, S.; Fang, C.; Pan, G.; Wang, J.; Liu, J. Redefining crowdsourced test report prioritization: An innovative approach with large language model. Inf. Softw. Technol. 2025, 179, 107629. [Google Scholar] [CrossRef]
Almatrafi, A.A.; Eassa, F.A.; Sharaf, S.A. Code clone detection techniques based on large language models. IEEE Access 2025, 13, 46136–46146. [Google Scholar] [CrossRef]
Nashaat, M.; Amin, R.; Eid, A.H.; Abdel-Kader, R.F. An enhanced transformer-based framework for interpretable code clone detection. J. Syst. Softw. 2025, 222, 112347. [Google Scholar] [CrossRef]
Mandli, A.R.; Rajput, S.; Sharma, T. COMET: Generating commit messages using delta graph context representation. J. Syst. Softw. 2025, 222, 112307. [Google Scholar] [CrossRef]
Kumar, A.; Sankar, S.; Das, P.P.; Chakrabarti, P.P. Using Large Language Models for multi-level commit message generation for large diffs. Inf. Softw. Technol. 2025, 187, 107831. [Google Scholar] [CrossRef]
Vijayaraghavan, P.; Nitsure, A.; Mackin, C.; Shi, L.; Ambrogio, S.; Haran, A.; Paruthi, V.; Elzein, A.; Coops, D.; Beymer, D.; et al. Chain-of-descriptions: Improving code llms for vhdl code generation and summarization. In Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, Salt Lake City, UT, USA, 9–11 September 2024; pp. 1–10. [Google Scholar]
Qayyum, K.; Jha, C.K.; Ahmadi-Pour, S.; Hassan, M.; Drechsler, R. LLM-assisted Bug Identification and Correction for Verilog HDL. ACM Trans. Des. Autom. Electron. Syst. 2025, 30, 101. [Google Scholar] [CrossRef]
Gramacki, P.; Martins, B.; Szymański, P. Evaluation of code llms on geospatial code generation. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, Atlanta, GA, USA, 29 October–1 November 2024; pp. 54–62. [Google Scholar]
Koziolek, H.; Koziolek, A. Llm-based control code generation using image recognition. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 38–45. [Google Scholar]
Ko, E.; Kang, P. Evaluating Coding Proficiency of Large Language Models: An Investigation Through Machine Learning Problems. IEEE Access 2025, 13, 52925–52938. [Google Scholar] [CrossRef]
Bhatia, S.; Gandhi, T.; Kumar, D.; Jalote, P. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 54–61. [Google Scholar]
Mathews, N.S.; Nagappan, M. Test-driven development and llm-based code generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1583–1594. [Google Scholar]
Takerngsaksiri, W.; Charakorn, R.; Tantithamthavorn, C.; Li, Y.F. Pytester: Deep reinforcement learning for text-to-testcase generation. J. Syst. Softw. 2025, 224, 112381. [Google Scholar] [CrossRef]
Lops, A.; Narducci, F.; Ragone, A.; Trizio, M. AgoneTest: Automated creation and assessment of Unit tests leveraging Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2440–2441. [Google Scholar]
Yang, L.; Yang, C.; Gao, S.; Wang, W.; Wang, B.; Zhu, Q.; Chu, X.; Zhou, J.; Liang, G.; Wang, Q.; et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1607–1619. [Google Scholar]
Wu, H.W.; Lee, S.J. Can Large Language Model Aid in Generating Properties for UPPAAL Timed Automata? A Case Study. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2248–2253. [Google Scholar]
Ma, L.; Liu, S.; Li, Y.; Xie, X.; Bu, L. Specgen: Automated generation of formal program specifications via large language models. arXiv 2024, arXiv:2401.08807. [Google Scholar] [CrossRef]
Yang, A.Z.; Le Goues, C.; Martins, R.; Hellendoorn, V. Large language models for test-free fault localization. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar]
Ji, S.; Lee, S.; Lee, C.; Han, Y.S.; Im, H. Impact of Large Language Models of Code on Fault Localization. In Proceedings of the 2025 IEEE Conference on Software Testing, Verification and Validation (ICST), Naples, Italy, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 302–313. [Google Scholar]
Ji, Z.; Ma, P.; Li, Z.; Wang, Z.; Wang, S. Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation. Proc. ACM Softw. Eng. 2025, 2, 1374–1397. [Google Scholar] [CrossRef]
Kang, S.; An, G.; Yoo, S. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proc. ACM Softw. Eng. 2024, 1, 1424–1446. [Google Scholar] [CrossRef]
Etemadi, K.; Mohammadi, B.; Su, Z.; Monperrus, M. Mokav: Execution-driven differential testing with llms. J. Syst. Softw. 2025, 230, 112571. [Google Scholar] [CrossRef]
Feng, S.; Lu, H.; Jiang, J.; Xiong, T.; Huang, L.; Liang, Y.; Li, X.; Deng, Y.; Aleti, A. Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1973–1978. [Google Scholar]
Xue, Z.; Li, L.; Tian, S.; Chen, X.; Li, P.; Chen, L.; Jiang, T.; Zhang, M. Llm4fin: Fully automating llm-powered test case generation for fintech software acceptance testing. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1643–1655. [Google Scholar]
Patel, S.; Yadavally, A.; Dhulipala, H.; Nguyen, T. Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2025; p. 639. [Google Scholar]
Rong, G.; Yu, Y.; Liu, S.; Tan, X.; Zhang, T.; Shen, H.; Hu, J. Code Comment Inconsistency Detection and Rectification Using a Large Language Model. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2024; pp. 432–443. [Google Scholar]
Wen, C.; Cai, Y.; Zhang, B.; Su, J.; Xu, Z.; Liu, D.; Qin, S.; Ming, Z.; Cong, T. Automatically inspecting thousands of static bug warnings with large language model: How far are we? ACM Trans. Knowl. Discov. Data 2024, 18, 168. [Google Scholar] [CrossRef]
Cheng, B.; Zhang, C.; Wang, K.; Shi, L.; Liu, Y.; Wang, H.; Guo, Y.; Li, D.; Chen, X. Semantic-enhanced indirect call analysis with large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 430–442. [Google Scholar]
Wu, C.; Chen, J.; Wang, Z.; Liang, R.; Du, R. Semantic sleuth: Identifying ponzi contracts via large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 582–593. [Google Scholar]
Jiang, Z.; Wen, M.; Cao, J.; Shi, X.; Jin, H. Towards understanding the effectiveness of large language models on directed test input generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1408–1420. [Google Scholar]
Ferrag, M.A.; Battah, A.; Tihanyi, N.; Jain, R.; Maimuţ, D.; Alwahedi, F.; Lestable, T.; Thandi, N.S.; Mechri, A.; Debbah, M.; et al. Securefalcon: Are we there yet in automated software vulnerability detection with llms? IEEE Trans. Softw. Eng. 2025, 51, 1248–1265. [Google Scholar] [CrossRef]
Yang, S.; Lin, X.; Chen, J.; Zhong, Q.; Xiao, L.; Huang, R.; Wang, Y.; Zheng, Z. Hyperion: Unveiling dapp inconsistencies using llm and dataflow-guided symbolic execution. arXiv 2024, arXiv:2408.06037. [Google Scholar] [CrossRef]
Wu, Y.; Xie, X.; Peng, C.; Liu, D.; Wu, H.; Fan, M.; Liu, T.; Wang, H. Advscanner: Generating adversarial smart contracts to exploit reentrancy vulnerabilities using llm and static analysis. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1019–1031. [Google Scholar]
Wang, C.; Zhang, J.; Gao, J.; Xia, L.; Guan, Z.; Chen, Z. Contracttinker: Llm-empowered vulnerability repair for real-world smart contracts. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2350–2353. [Google Scholar]
Acharya, J.; Ginde, G. Graph neural network vs. large language model: A comparative analysis for bug report priority and severity prediction. In Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering, Porto de Galinhas, Brazil, 16 July 2024; pp. 2–11. [Google Scholar]
Ramler, R.; Straubinger, P.; Plösch, R.; Winkler, D. Unit Testing Past vs. Present: Examining LLMs’ Impact on Defect Detection and Efficiency. arXiv 2025, arXiv:2502.09801. [Google Scholar] [CrossRef]
Al-Turki, D.; Hettiarachchi, H.; Gaber, M.M.; Abdelsamea, M.M.; Basurra, S.; Iranmanesh, S.; Saadany, H.; Vakaj, E. Human-in-the-Loop learning with LLMs for efficient RASE tagging in building compliance regulations. IEEE Access 2024, 12, 185291–185306. [Google Scholar] [CrossRef]
Zamfirescu-Pereira, J.; Jun, E.; Terry, M.; Yang, Q.; Hartmann, B. Beyond code generation: Llm-supported exploration of the program design space. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–17. [Google Scholar]
Sun, Y.; Hao, B.; Wang, X.; Zhao, C.; Zhao, Y.; Shi, B.; Zhang, S.; Ge, Q.; Li, W.; Wei, H.; et al. LLM-Augmented Ticket Aggregation for Low-cost Mobile OS Defect Resolution. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 215–226. [Google Scholar]
Kang, S.; Yoon, J.; Yoo, S. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2312–2323. [Google Scholar]
Kang, S.; Chen, B.; Yoo, S.; Lou, J.G. Explainable automated debugging via large language model-driven scientific debugging. Empir. Softw. Eng. 2024, 30, 45. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Z.; Feng, Y.; Xu, B. Leveraging large language model to assist detecting rust code comment inconsistency. In Proceedings of the 39th IEEE/ACM international conference on automated software engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 356–366. [Google Scholar]
Mandal, U.; Shukla, S.; Rastogi, A.; Bhattacharya, S.; Mukhopadhyay, D. μLAM: A LLM-Powered Assistant for Real-Time Micro-architectural Attack Detection and Mitigation. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 1 October–1 November 2024; pp. 1–9. [Google Scholar]
Xia, Y.; Shao, H.; Deng, X. Vulcobert: A codebert-based system for source code vulnerability detection. In Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security, Kuala Lumpur, Malaysia, 10–12 May 2024; pp. 249–252. [Google Scholar]
Zhao, Y.; Gong, L.; Huang, Z.; Wang, Y.; Wei, M.; Wu, F. Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 1732–1744. [Google Scholar] [CrossRef]
Gao, C.; Chen, X.; Zhang, G. SVA-ICL: Improving LLM-based software vulnerability assessment via in-context learning and information fusion. Inf. Softw. Technol. 2025, 186, 107803. [Google Scholar] [CrossRef]
Yang, X.; Zhu, W.; Pacheco, M.; Zhou, J.; Wang, S.; Hu, X.; Liu, K. Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM. Proc. ACM Softw. Eng. 2025, 2, 489–510. [Google Scholar] [CrossRef]
Nangia, A.; Ayachitula, S.; Kundu, C. In-Context Vulnerability Propagation in LLMs [Work In Progress Paper]. In Proceedings of the 30th ACM Symposium on Access Control Models and Technologies, Stony Brook, NY, USA, 8–10 July 2025; pp. 169–174. [Google Scholar]
Wu, Y.; Wen, M.; Yu, Z.; Guo, X.; Jin, H. Effective vulnerable function identification based on cve description empowered by large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 393–405. [Google Scholar]
Aljedaani, W.; Eler, M.M.; Parthasarathy, P. Enhancing accessibility in software engineering projects with large language models (llms). In Proceedings of the 56th ACM Technical Symposium on Computer Science Education, Pittsburgh, PA, USA, 26 February–1 March 2025; Volume 1, pp. 25–31. [Google Scholar]
Oertel, J.; Jil Klünder, R.H. Don’t settle for the first! How many GitHub Copilot solutions should you check? Inf. Softw. Technol. 2025, 183, 107737. [Google Scholar] [CrossRef]
Yang, R.; Fu, M.; Tantithamthavorn, C.; Arora, C.; Vandenhurk, L.; Chua, J. RAGVA: Engineering retrieval augmented generation-based virtual assistants in practice. arXiv 2025, arXiv:2502.14930. [Google Scholar] [CrossRef]
Kulsum, U.; Zhu, H.; Xu, B.; d’Amorim, M. A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In Proceedings of the 1st ACM International Conference on AI-Powered Software, Porto de Galinhas, Brazil, 15–16 July 2024; pp. 103–111. [Google Scholar]
Yadav, D.; Mondal, S. Evaluating Pre-trained Large Language Models on Zero Shot Prompts for Parallelization of Source Code. J. Syst. Softw. 2025, 230, 112543. [Google Scholar] [CrossRef]
Dong, J.; Sun, J.; Zhang, W.; Dong, J.S.; Hao, D. ConTested: Consistency-Aided Tested Code Generation with LLM. Proc. ACM Softw. Eng. 2025, 2, 596–617. [Google Scholar] [CrossRef]
Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the BLEU: How should we assess quality of the Code Generation models? J. Syst. Softw. 2023, 203, 111741. [Google Scholar] [CrossRef]
Mansur, E.; Chen, J.; Raza, M.A.; Wardat, M. RAGFix: Enhancing LLM Code Repair Using RAG and Stack Overflow Posts. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7491–7496. [Google Scholar]
Tomic, S.; Alégroth, E.; Isaac, M. Evaluation of the Choice of LLM in a Multi-agent Solution for GUI-Test Generation. In Proceedings of the 2025 IEEE Conference on Software Testing, Verification and Validation (ICST), Naples, Italy, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 487–497. [Google Scholar]
Chapman, P.J.; Rubio-González, C.; Thakur, A.V. Interleaving static analysis and llm prompting. In Proceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, Copenhagen, Denmark, 25 June 2024; pp. 9–17. [Google Scholar]
Zhou, X.; Zhang, T.; Lo, D. Large language model for vulnerability detection: Emerging results and future directions. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, Lisbon Portugal, 14–20 April 2024; pp. 47–51. [Google Scholar]
Sergeyuk, A.; Golubev, Y.; Bryksin, T.; Ahmed, I. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward. Inf. Softw. Technol. 2025, 178, 107610. [Google Scholar] [CrossRef]
Li, J.; Liu, S.; Jin, Z. Automated formal-specification-to-code trace links recovery using multi-dimensional similarity measures. J. Syst. Softw. 2025, 226, 112439. [Google Scholar] [CrossRef]
Lai, C.; Zhou, Z.; Poptani, A.; Zhang, W. Lcm: Llm-focused hybrid spm-cache architecture with cache management for multi-core ai accelerators. In Proceedings of the 38th ACM International Conference on Supercomputing, Kyoto, Japan, 4–7 June 2024; pp. 62–73. [Google Scholar]
Choudhuri, R.; Liu, D.; Steinmacher, I.; Gerosa, M.; Sarma, A. How far are we? the triumphs and trials of generative ai in learning software engineering. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
Nguyen, P.T.; Di Rocco, J.; Di Sipio, C.; Rubei, R.; Di Ruscio, D.; Di Penta, M. GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT. J. Syst. Softw. 2024, 214, 112059. [Google Scholar] [CrossRef]
Pedroso, D.F.; Almeida, L.; Pulcinelli, L.E.G.; Aisawa, W.A.A.; Dutra, I.; Bruschi, S.M. Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks. IEEE Access 2025, 13, 77550–77564. [Google Scholar] [CrossRef]
Han, Y.; Du, Q.; Huang, Y.; Wu, J.; Tian, F.; He, C. The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 931–943. [Google Scholar]
North, M.; Atapour-Abarghouei, A.; Bencomo, N. Code gradients: Towards automated traceability of llm-generated code. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 321–329. [Google Scholar]
Ali, M.; Giallousi, N.; Melidis, A.; Alexopoulos, C.; Charalabidis, Y. GlossAPI: Architecturing the Greek Data Pile for LLM development. In Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics, Athens, Greece, 13–15 December 2024; pp. 16–25. [Google Scholar]
Xu, Z.; Kong, D.; Liu, J.; Li, J.; Hou, J.; Dai, X.; Li, C.; Wei, S.; Hu, Y.; Yin, S. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, Tokyo, Japan, 21–25 June 2025; pp. 1–17. [Google Scholar]

Figure 1. PRISMA flow of study selection: records identified (n = 535) → title/abstract screened (n = 535) with exclusions (n = 259) → full-text assessed (n = 276) with exclusions (n = 52) → studies included in the final review (n = 224).

Figure 2. Heatmap of included studies across SE phases (rows) and FM capabilities (columns). Counts are shown per cell.

Table 1. Examples of FM applications across software engineering phases.

SE Phase	Common FM Task	Example Studies and Applications
Requirements	Requirement extraction, summarization, and translation	FMs support writing and checking requirement documents, tracing links, and translating user stories across languages or project teams [21,23,27,30,101,102,103,104].
Design/ Architecture	Decision support, rationale explanation	Models help designers summarize design decisions, compare options, and explain architectural choices [3,32,33,105,106].
Implementation/ Coding	Code generation, completion, and refactoring	FMs produce working code, translate between languages such as C++ and JavaScript, and suggest small fixes or cleanups [6,35,40,107,108,109,110,111,112].
Testing/QA	Test generation, bug detection, log summarization	LLMs create unit tests, locate bugs, and summarize failure reports for easier debugging [54,88,89,113,114,115,116,117,118,119,120,121,122,123,124].
Maintenance/ Evolution	Program repair, defect classification, refactoring	FMs suggest patches for faulty code, group related bug reports, and classify types of technical debt [41,53,125,126,127,128,129,130].
Project/Process Management	Workflow support, prioritization, and coordination	Multi-agent LLM systems are used for agile planning, task triage, and summarizing project updates [131,132,133,134].
Other/ Cross-cutting	Summarization and linking across artifacts	FMs connect related items, such as linking requirements to commits or summarizing large logs [19,20,32,135,136].

Table 2. Summary of search results, inclusion, exclusion, and cumulative retained studies after each source.

Source	Total Records	Included (%)	Cumulative Retained
IEEE Xplore	185	102 (55.1%)	102
ACM Digital Library	112	97 (86.6%)	197
ScienceDirect	223	65 (29.1%)	262
SpringerLink	15	14 (93.3%)	276
Total (before deduplication)	535	278	–
Total (after deduplication)	–	276	276

– indicates not applicable.

Table 3. Main reasons for exclusion during screening and full-text review. Detailed counts are provided in the replication package.

Exclusion Reason
Not a software engineering task or artifact
No empirical evaluation or validation
Focuses on model development only (no SE task)
Unclear or duplicated study

Table 4. Representative FM-based tools in SE.

Tool	Description/Use	Ref.
Codex	Basis of Copilot; code and test generation.	[6]
StarCoder	Open BigCode model for code tasks.	[8]
GPT-4	Proprietary LLM; broad SE applications.	[7]
CodeT5+	Transformer for code intelligence tasks.	[10]
RepairAgent	Autonomous LLM-based program repair.	[151]
Code Llama	Open LLaMA2-based family for code.	[9]
AlphaCode	Competitive programming system.	[152]

Table 5. Representative datasets/benchmarks in SE.

Dataset/Benchmark	Focus/Usage	Ref.
HumanEval	Python code generation benchmark.	[6]
BigCodeBench	Large-scale eval suite for StarCoder.	[8]
Defects4J	Program repair and fault localization.	[10]
TFix	Code fix dataset (JavaScript).	[10]
ConDefects	Leakage-aware repair/localization data.	[96]
MBPP	974 curated Python problems.	[4]
CodeXGLUE	Multi-task benchmark suite.	[153]
CodeContests	Competitive programming problems.	[152]
Android vuln. repair	Android security repair (Java/XML) with human-validated fixes.	[154]

Table 6. Two-dimensional taxonomy (compact): count of included studies by primary SE phase (rows) and FM capability (columns).

FM_Capability	Arch/Design	Bug/Defect	CodeGen	Summ.	Transl.	Repair	TestGen	Reqts	Other	Row_Total
Primary_SE_Phase
Design/Arch	3	0	2	0	1	0	0	2	4	12
Impl/Coding	0	1	43	5	4	0	0	0	5	58
Maint/Evol	0	10	0	8	0	17	0	0	9	44
Other	0	0	4	0	0	0	0	0	9	13
Process Mgmt	1	0	1	2	0	0	0	1	9	14
Requirements	0	0	1	3	0	0	0	14	1	19
Testing/QA	0	28	6	2	1	2	19	0	6	64
Col_Total	4	39	57	20	6	19	19	17	43	224

Table 7. At-a-glance summary of FM-driven testing tasks, primary FM capability, exemplars, key strength, and key challenge.

Task	FM Capability	Refs.	Key Strength	Key Challenge
Unit test gen.	Code/test generation	[75,76,88,171,172,173,174,175]	Higher line/branch coverage; strong pass rates	Prompt/seed variance; coverage vs. correctness
Property/oracle gen.	Spec drafting & repair	[176,177]	Automates formal specs; augments human oracles	Oracle cost; verifier latency
Fault localization	Explanation/ranking	[178,179,180,181]	Beats ML baselines; works without tests	Stability; leakage across datasets
Differential testing	Behavioral comparison	[182]	Iterative execution feedback; high difference-exposing tests	Runtime harness cost; flaky diffs
UI/acceptance	Planning + RAG	[183,184]	Cost-effective automation; high scenario/code coverage	Grounding; latency; cost
Static + semantics	Summarization	[185,186,187,188,189,190]	Handles partial context; strengthens static triage	Indirect calls; partial context
Security	Detection	[114,191,192,193,194]	Line-level vuln. detection; interpretable explanations	Precision/recall vs. deployability
Human factors	Assistance & triage	[195,196,197,198,199]	Improves developer productivity and triage confidence	Trust; false positives; workflow fit

Table 8. Challenge → Opportunity map for FM-driven testing.

Challenge (C)	Corresponding Opportunity (O)	References
C1—Prompt/seed/model variance	O1—Add structure (method slicing, structured seeds) and RAG; O2—Use execution/verification loops to stabilise outputs	[54,55,64,75,76,88,182]
C2—Oracle construction & verifier latency	O2—Verifier-/execution-in-the-loop controllers; cached checks; selective verification	[55,176,177,182,215]
C3—Data leakage & comparability	O4—Time-sliced/complementary corpora; full prompt/seed reporting; multi-signal metrics	[19,20,49,96,216]
C4—Grounding/scale for UI & acceptance	O1—RAG over screens/DOM and business rules; cost controllers; process artifacts	[183,184,217,218]
C5—Static semantic gaps (indirect calls, partial code)	O2—CFG planners and semantic summaries; integrate with static analyses	[185,186,188,219]
C6—Security deployability (accuracy vs. latency)	O3—Compact/task-adapted models; hybrid LLM + analysis pipelines; on-prem CPU paths	[191,192,193,220]
C7—Integration & developer trust (IDE/CI, false positives)	O5—Human–AI collaboration patterns: explain–edit–enforce; rationale summaries; CI risk gates	[196,198,211,221]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Banitaan, S.; Daoud, M.; Alquran, H.; Akour, M. Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information 2026, 17, 73. https://doi.org/10.3390/info17010073

AMA Style

Banitaan S, Daoud M, Alquran H, Akour M. Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information. 2026; 17(1):73. https://doi.org/10.3390/info17010073

Chicago/Turabian Style

Banitaan, Shadi, Mohammad Daoud, Hebah Alquran, and Mohammad Akour. 2026. "Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support" Information 17, no. 1: 73. https://doi.org/10.3390/info17010073

APA Style

Banitaan, S., Daoud, M., Alquran, H., & Akour, M. (2026). Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information, 17(1), 73. https://doi.org/10.3390/info17010073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support

Abstract

1. Introduction

2. Novelty and Significance

3. Background and Related Work

3.1. Foundation Models in Software Engineering

3.2. Software Engineering Phases

3.3. Existing Surveys and Gaps

4. Methodology

4.1. Quality Appraisal

4.2. Representative Tools and Datasets

5. A Two-Dimensional Taxonomy of FM Use in SE

5.1. Taxonomy Table and Summary

5.2. Insights and Gaps

6. In-Depth Analysis: Foundation Models in Software Testing

6.1. Unit Test Generation

6.2. Property & Oracle Generation

6.3. Fault Localization (FL)

6.4. Differential/Regression Testing

6.5. System/UI Acceptance Testing

6.6. Static Analysis Triage & Semantic Assistance

6.7. Security Testing & Vulnerability Analysis

6.8. Human-in-the-Loop Testing Practice

7. Methods, Benchmarks, and Evidence

Cross-Cutting Challenges and Opportunities

8. Challenges and Future Directions

8.1. Cross-Cutting Challenges

8.2. Future Research Opportunities

8.3. Threats to Validity

8.4. Actionable Framework for FM Adoption in Software Engineering

9. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI