Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support
Abstract
1. Introduction
- RQ1: How are foundation models being used in the different phases of the software engineering process?
- RQ2: What specific capabilities do these models provide, and how are these capabilities applied within various SE phases?
- RQ3: What are the main strengths, limitations, and unresolved challenges when using foundation models for software testing?
- RQ4: Where do current research and tools fall short, and what future directions could help advance foundation model–driven software engineering?
- A phase–capability taxonomy of foundation model use in software engineering. We introduce a two-dimensional taxonomy that connects where foundation models are applied across the software engineering life cycle (for example, requirements, design, implementation, testing, maintenance, and project management) with what capability they provide (such as code generation, summarization, or defect repair). Earlier surveys typically focused on only one of these aspects. Our taxonomy combines both dimensions, allowing us to identify well-studied areas and those that remain underexplored.
- A systematic review of 224 recent studies with transparent selection and classification. We reviewed more than 500 papers and included 224 that demonstrate concrete uses of foundation models in software engineering. The review follows a reproducible protocol and maps each study to its corresponding life-cycle phase and model capability, offering a structured and up-to-date picture of current research activity.
- A detailed evidence map of how foundation models support software testing and quality assurance. We organize and analyze prior work into eight testing and QA task families: unit test generation, oracle creation, fault localization, regression testing, UI testing, bug triage, vulnerability detection, and human-in-the-loop QA. For each task, we summarize typical workflows, empirical strengths, and known limitations such as non-determinism, oracle cost, or benchmark leakage.
- A practitioner-oriented adoption agenda. Based on recurring strengths, limitations, and methodological patterns observed in the reviewed literature, we outline practical recommendations for integrating foundation models into software engineering workflows. These include retrieval-augmented prompting, execution- or verification-in-the-loop strategies, task-specific adaptation, and safeguards against data leakage and bias [68].
2. Novelty and Significance
3. Background and Related Work
3.1. Foundation Models in Software Engineering
3.2. Software Engineering Phases
3.3. Existing Surveys and Gaps
- Phase–capability linkage. Prior work lacks a taxonomy that binds where in the lifecycle a contribution lands to what capability it exercises. We introduce such a two-dimensional taxonomy in Section 5 and use it to analyze 224 included studies.
- Leakage-aware, reproducible evaluation. Prior surveys highlight risks of data leakage and inconsistent reporting [19,20]. We foreground these issues in our analysis by synthesizing evidence on dataset comparability, prompt/seed sensitivity, and reporting practices, and by emphasizing the need for transparent reporting standards.
- Actionable challenges → opportunities. Beyond listing limitations, we map cross-cutting challenges (variance, oracle cost, grounding, deployability, trust) to concrete opportunities (structure+retrieval, execution/verification in the loop, task-specific adaptation, leakage-aware evaluation, collaboration patterns) with exemplars (Section 8).
4. Methodology
(“foundation model” OR “LLM” OR “large language model” OR “code generation” OR “generative AI”) AND (“software engineering” OR “software development” OR “requirements” OR “design” OR “implementation” OR “software testing” OR “maintenance”)
- The paper applies a foundation model/large-scale pretrained model (LLM/FM; encoder-only, encoder–decoder, or decoder-only) to a concrete SE task.
- The contribution maps to at least one SE lifecycle phase in our taxonomy (requirements, design/architecture, implementation, testing/QA, maintenance/evolution, or project/quality management).
- Reports sufficient methodological detail (task, data, procedure) and an empirical evaluation (e.g., on realistic benchmarks such as HumanEval, Defects4J, SF110, industrial logs, or comparable datasets) [149].
- Work that only improves or analyzes LLMs/FMs themselves (training tricks, NAS, alignment, ethics) without an SE task or artifact.
- Out-of-scope domains with no SE activity (e.g., medical EEG/robot cognition, generic NLP classification, legal analysis without SE artifacts).
- Position/vision papers without empirical validation; non-English publications; and duplicates.
- Title/abstract screening against the criteria above.
- Full-text review for papers marked as potentially relevant.
- Deduplication across venues and years.
- Title/abstract screening: (almost perfect agreement).
- Full-text screening: (almost perfect agreement).
4.1. Quality Appraisal
4.2. Representative Tools and Datasets
5. A Two-Dimensional Taxonomy of FM Use in SE
5.1. Taxonomy Table and Summary
5.2. Insights and Gaps
- RQ3—Strengths, limitations, and open challenges in testing.
- RQ4—Gaps and future directions.
- Design/Architecture × Summarization/Translation: using FMs to summarize design rationales, migrate architecture documentation, or translate between modeling notations; and to preserve consistency across heterogeneous models in low-code/model-driven settings [155], or translating between modeling notations and generating development artifacts like traces [155,156].
- Project and Process Management: applying foundation models to project planning, risk assessment, and incident analysis—going beyond the earlier ad hoc or miscellaneous “Other” applications noted in prior studies. The acronym FAIL stands for Failure Analysis via Intelligent Learning; it uses LLMS to automatically collect, cluster, and analyze software failure or incident reports. These analyses produce what are known as postmortem-style corpora—collections of structured summaries written after incidents occur—which can then be compared across organizations to identify recurring causes and improvement opportunities. Recent studies also use LLMs to analyze CI/CD pipelines, summarize process issues, and identify governance or planning problems [159,160].
- Testing/QA × Summarization: assisting test oracle explanation and failure report condensation for developer handoff. For example, LLMPrior clusters and prioritizes crowdsourced textual test reports to reduce reviewer reading load and speed triage [161].
- Domain-specific expansion: Applying and tailoring FM techniques to specialized domains beyond general software, such as hardware design (e.g., VHDL [166], Verilog [167]), geospatial programming [168], and control code generation from images [169]. This expansion necessitates the creation of specialized benchmarks and prompts to handle domain-specific syntax, semantics, and constraints [170].
6. In-Depth Analysis: Foundation Models in Software Testing
6.1. Unit Test Generation
6.2. Property & Oracle Generation
6.3. Fault Localization (FL)
6.4. Differential/Regression Testing
6.5. System/UI Acceptance Testing
6.6. Static Analysis Triage & Semantic Assistance
6.7. Security Testing & Vulnerability Analysis
6.8. Human-in-the-Loop Testing Practice
7. Methods, Benchmarks, and Evidence
Cross-Cutting Challenges and Opportunities
- RQ1: Where are foundation models most applied across software testing?
- RQ2: What capabilities of foundation models are most leveraged in testing?
- RQ3: What are the key strengths and limitations of foundation models in software testing?
- RQ4: What research gaps and opportunities remain?
8. Challenges and Future Directions
8.1. Cross-Cutting Challenges
- C1—Prompt/seed/model variance and harness sensitivity.
- C2—Oracle construction and verification costs.
- C3—Data leakage and dataset comparability.
- C4—Grounding and scalability for system/UI testing.
- C5—Semantic gaps in static contexts.
- C6—Deployability in security pipelines.
- C7—Sociotechnical integration and trust.
8.2. Future Research Opportunities
- O1—Structure over raw prompting.
- O2—Retrieval-grounded planning at scale.
- O3—Execution/verification-in-the-loop.
- O4—Task-specific adaptation with accountability.
- O5—Leakage-aware benchmarking and reporting.
- O6—Human–AI collaboration patterns.
- O7—Domain-specialised safety and governance.
8.3. Threats to Validity
- Construct validity:
- Internal validity:
- External validity:
- Conclusion validity:
- Reproducibility:
- Methodological Novelty.
- Impact and Research Implications.
8.4. Actionable Framework for FM Adoption in Software Engineering
- (1)
- Phase–Function Scoping.
- (2)
- Data and Representation Alignment.
- (3)
- Integration and Iterative Evaluation.
- (4)
- Cross-Phase Feedback.
- (5)
- Institutionalization and Benchmarking.
9. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
- Sauvola, J.; Tarkoma, S.; Klemettinen, M.; Riekki, J.; Doermann, D. Future of software development with generative AI. Autom. Softw. Eng. 2024, 31, 26. [Google Scholar] [CrossRef]
- Liu, Y.; Lo, S.K.; Lu, Q.; Zhu, L.; Zhao, D.; Xu, X.; Harrer, S.; Whittle, J. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents. J. Syst. Softw. 2025, 220, 112278. [Google Scholar] [CrossRef]
- Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program synthesis with large language models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
- Yan, D.; Gao, Z.; Liu, Z. A closer look at different difficulty levels code generation abilities of chatgpt. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1887–1898. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.D.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
- Liang, J.T.; Badea, C.; Bird, C.; DeLine, R.; Ford, D.; Forsgren, N.; Zimmermann, T. Can gpt-4 replicate empirical software engineering research? Proc. ACM Softw. Eng. 2024, 1, 1330–1353. [Google Scholar] [CrossRef]
- Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. Starcoder: May the source be with you! arXiv 2023, arXiv:2305.06161. [Google Scholar] [CrossRef]
- Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Wang, W.; Wang, Y.; Joty, S.; Hoi, S.C. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 5–7 December 2023; pp. 146–158. [Google Scholar]
- Li, J.; Tao, C.; Li, J.; Li, G.; Jin, Z.; Zhang, H.; Fang, Z.; Liu, F. Large language model-aware in-context learning for code generation. ACM Trans. Softw. Eng. Methodol. 2023, 34, 190. [Google Scholar] [CrossRef]
- Xu, W.; Gao, K.; He, H.; Zhou, M. Licoeval: Evaluating llms on license compliance in code generation. arXiv 2024, arXiv:2408.02487. [Google Scholar]
- Bui, T.D.; Vu, T.T.; Nguyen, T.T.; Nguyen, S.; Vo, H.D. Correctness assessment of code generated by Large Language Models using internal representations. J. Syst. Softw. 2025, 230, 112570. [Google Scholar] [CrossRef]
- Banh, L.; Holldack, F.; Strobel, G. Copiloting the future: How generative AI transforms Software Engineering. Inf. Softw. Technol. 2025, 183, 107751. [Google Scholar] [CrossRef]
- Da Silva, L.; Samhi, J.; Khomh, F. LLMs and Stack Overflow discussions: Reliability, impact, and challenges. J. Syst. Softw. 2025, 230, 112541. [Google Scholar] [CrossRef]
- Yang, G.; Zhou, Y.; Chen, X.; Zhang, X.; Zhuo, T.Y.; Chen, T. Chain-of-thought in neural code generation: From and for lightweight language models. IEEE Trans. Softw. Eng. 2024, 50, 2437–2457. [Google Scholar] [CrossRef]
- Alagarsamy, S.; Tantithamthavorn, C.; Takerngsaksiri, W.; Arora, C.; Aleti, A. Enhancing large language models for text-to-testcase generation. J. Syst. Softw. 2025, 230, 112531. [Google Scholar] [CrossRef]
- Basha, M.; Rodríguez-Pérez, G. Trust, transparency, and adoption in generative AI for software engineering: Insights from Twitter discourse. Inf. Softw. Technol. 2025, 186, 107804. [Google Scholar] [CrossRef]
- Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large language models for software engineering: A systematic literature review. ACM Trans. Softw. Eng. Methodol. 2024, 33, 220. [Google Scholar] [CrossRef]
- Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software testing with large language models: Survey, landscape, and vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
- Hemmat, A.; Sharbaf, M.; Kolahdouz-Rahimi, S.; Lano, K.; Tehrani, S.Y. Research directions for using LLM in software requirement engineering: A systematic review. Front. Comput. Sci. 2025, 7, 1519437. [Google Scholar] [CrossRef]
- Rasnayaka, S.; Wang, G.; Shariffdeen, R.; Iyer, G.N. An empirical study on usage and perceptions of llms in a software engineering project. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 111–118. [Google Scholar]
- Wei, B. Requirements are all you need: From requirements to code with llms. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 416–422. [Google Scholar]
- Lubos, S.; Felfernig, A.; Tran, T.N.T.; Garber, D.; El Mansi, M.; Erdeniz, S.P.; Le, V.M. Leveraging llms for the quality assurance of software requirements. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 389–397. [Google Scholar]
- Krishna, M.; Gaur, B.; Verma, A.; Jalote, P. Using llms in software requirements specifications: An empirical evaluation. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 475–483. [Google Scholar]
- Feng, N.; Marsso, L.; Yaman, S.G.; Standen, I.; Baatartogtokh, Y.; Ayad, R.; De Mello, V.O.; Townsend, B.; Bartels, H.; Cavalcanti, A.; et al. Normative requirements operationalization with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 129–141. [Google Scholar]
- Mu, F.; Shi, L.; Wang, S.; Yu, Z.; Zhang, B.; Wang, C.; Liu, S.; Wang, Q. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification. Proc. ACM Softw. Eng. 2024, 1, 2332–2354. [Google Scholar] [CrossRef]
- Ferrari, A.; Spoletini, P. Formal requirements engineering and large language models: A two-way roadmap. Inf. Softw. Technol. 2025, 181, 107697. [Google Scholar] [CrossRef]
- Dong, Y.; Kong, L.; Zhang, L.; Wang, S.; Liu, X.; Liu, S.; Chen, M. A search-and-fill strategy to code generation for complex software requirements. Inf. Softw. Technol. 2025, 177, 107584. [Google Scholar] [CrossRef]
- Hassani, S.; Sabetzadeh, M.; Amyot, D. An empirical study on LLM-based classification of requirements-related provisions in food-safety regulations. Empir. Softw. Eng. 2025, 30, 72. [Google Scholar] [CrossRef]
- Odu, O.; Belle, A.B.; Wang, S.; Kpodjedo, S.; Lethbridge, T.C.; Hemmati, H. Automatic instantiation of assurance cases from patterns using large language models. J. Syst. Softw. 2025, 222, 112353. [Google Scholar] [CrossRef]
- Maranhão, J.J.; Guerra, E.M. A prompt pattern sequence approach to apply generative AI in assisting software architecture decision-making. In Proceedings of the 29th European Conference on Pattern Languages of Programs, People, and Practices, Irsee, Germany, 3–7 July 2024; pp. 1–12. [Google Scholar]
- Zhao, J.; Yang, Z.; Zhang, L.; Lian, X.; Yang, D.; Tan, X. DRMiner: Extracting latent design rationale from Jira issue logs. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 468–480. [Google Scholar]
- Ahlgren, T.L.; Sunde, H.F.; Kemell, K.K.; Nguyen-Duc, A. Assisting early-stage software startups with LLMs: Effective prompt engineering and system instruction design. Inf. Softw. Technol. 2025, 187, 107832. [Google Scholar] [CrossRef]
- Cordeiro, J.; Noei, S.; Zou, Y. An empirical study on the code refactoring capability of large language models. arXiv 2024, arXiv:2411.02320. [Google Scholar] [CrossRef]
- Ishizue, R.; Sakamoto, K.; Washizaki, H.; Fukazawa, Y. Improved program repair methods using refactoring with GPT models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education, Portland, OR, USA, 20–23 March 2024; Volume 1, pp. 569–575. [Google Scholar]
- Pomian, D.; Bellur, A.; Dilhara, M.; Kurbatova, Z.; Bogomolov, E.; Sokolov, A.; Bryksin, T.; Dig, D. Em-assist: Safe automated extractmethod refactoring with llms. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, 15–19 July 2024; pp. 582–586. [Google Scholar]
- Wu, D.; Mu, F.; Shi, L.; Guo, Z.; Liu, K.; Zhuang, W.; Zhong, Y.; Zhang, L. ismell: Assembling llms with expert toolsets for code smell detection and refactoring. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1345–1357. [Google Scholar]
- Xu, K.; Zhang, G.L.; Yin, X.; Zhuo, C.; Schlichtmann, U.; Li, B. HLSRewriter: Efficient Refactoring and Optimization of C/C++ Code with LLMs for High-Level Synthesis. ACM Trans. Des. Autom. Electron. Syst. 2025. [Google Scholar] [CrossRef]
- Zhao, J.; Song, Y.; Cohen, E. Variational Prefix Tuning for diverse and accurate code summarization using pre-trained language models. J. Syst. Softw. 2025, 229, 112493. [Google Scholar] [CrossRef]
- Zubair, F.; Al-Hitmi, M.; Catal, C. The use of large language models for program repair. Comput. Stand. Interfaces 2025, 93, 103951. [Google Scholar] [CrossRef]
- Jin, M.; Shahriar, S.; Tufano, M.; Shi, X.; Lu, S.; Sundaresan, N.; Svyatkovskiy, A. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 5–7 December 2023; pp. 1646–1656. [Google Scholar]
- Luo, W.; Keung, J.; Yang, B.; Ye, H.; Le Goues, C.; Bissyande, T.F.; Tian, H.; Le, X.B.D. When Fine-Tuning LLMs Meets Data Privacy: An Empirical Study of Federated Learning in LLM-Based Program Repair. ACM Trans. Softw. Eng. Methodol. 2024. [Google Scholar] [CrossRef]
- Li, H.; Hao, Y.; Zhai, Y.; Qian, Z. Enhancing static analysis for practical bug detection: An llm-integrated approach. Proc. ACM Program. Lang. 2024, 8, 474–499. [Google Scholar] [CrossRef]
- Guan, H.; Bai, G.; Liu, Y. CrossProbe: LLM-Empowered Cross-Project Bug Detection for Deep Learning Frameworks. Proc. ACM Softw. Eng. 2025, 2, 2430–2452. [Google Scholar] [CrossRef]
- Huang, K.; Meng, X.; Zhang, J.; Liu, Y.; Wang, W.; Li, S.; Zhang, Y. An empirical study on fine-tuning large language models of code for automated program repair. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1162–1174. [Google Scholar]
- Huang, K.; Zhang, J.; Meng, X.; Liu, Y. Template-guided program repair in the era of large language models. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2025; pp. 367–379. [Google Scholar]
- Li, G.; Zhi, C.; Chen, J.; Han, J.; Deng, S. Exploring parameter-efficient fine-tuning of large language model on automated program repair. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 719–731. [Google Scholar]
- Kong, J.; Xie, X.; Liu, S. Demystifying Memorization in LLM-Based Program Repair via a General Hypothesis Testing Framework. Proc. ACM Softw. Eng. 2025, 2, 2712–2734. [Google Scholar] [CrossRef]
- Lajkó, M.; Csuvik, V.; Gyimothy, T.; Vidács, L. Automated program repair with the gpt family, including gpt-2, gpt-3 and codex. In Proceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair, Lisbon, Portugal, 20 April 2024; pp. 34–41. [Google Scholar]
- Xiao, J.; Xu, Z.; Chen, S.; Lei, G.; Fan, G.; Cao, Y.; Deng, S.; Feng, Z. Confix: Combining node-level fix templates and masked language model for automatic program repair. J. Syst. Softw. 2024, 216, 112116. [Google Scholar] [CrossRef]
- Zhang, Y.; Jin, Z.; Xing, Y.; Li, G.; Liu, F.; Zhu, J.; Dou, W.; Wei, J. PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing. ACM Trans. Softw. Eng. Methodol. 2025, 35, 3. [Google Scholar] [CrossRef]
- Shivashankar, K.; Orucevic, M.; Kruke, M.M.; Martini, A. BEACon-TD: Classifying Technical Debt and its types across diverse software projects issues using transformers. J. Syst. Softw. 2025, 226, 112435. [Google Scholar] [CrossRef]
- Ouédraogo, W.C.; Kaboré, K.; Li, Y.; Tian, H.; Koyuncu, A.; Klein, J.; Lo, D.; Bissyandé, T.F. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation. arXiv 2024, arXiv:2407.00225. [Google Scholar] [CrossRef]
- Bose, D.B. From Prompts to Properties: Rethinking LLM Code Generation with Property-Based Testing. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1660–1665. [Google Scholar]
- Huang, D.; Zhang, J.M.; Bu, Q.; Xie, X.; Chen, J.; Cui, H. Bias testing and mitigation in llm-based code generation. ACM Trans. Softw. Eng. Methodol. 2024, 35, 5. [Google Scholar] [CrossRef]
- Boukhlif, M.; Kharmoum, N.; Hanine, M. Llms for intelligent software testing: A comparative study. In Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security, Meknes, Morocco, 18–19 April 2024; pp. 1–8. [Google Scholar]
- Liao, Y.; Zhang, J.; Keung, J.; Xiao, Y.; Dai, Y. Advancing autonomous driving system testing: Demands, challenges, and future directions. Inf. Softw. Technol. 2025, 187, 107859. [Google Scholar] [CrossRef]
- Dakhel, A.M.; Nikanjam, A.; Majdinasab, V.; Khomh, F.; Desmarais, M.C. Effective test generation using pre-trained Large Language Models and mutation testing. Inf. Softw. Technol. 2024, 171, 107468. [Google Scholar] [CrossRef]
- Ihalage, A.; Taheri, S.; Muhammad, F.; Al-Raweshidy, H. Convolutional Versus Large Language Models for Software Log Classification in Edge-Deployable Cellular Network Testing. IEEE Access 2025, 13, 134283–134296. [Google Scholar] [CrossRef]
- Zhang, Y.; Chen, T.Y.; Pike, M.; Towey, D.; Ying, Z.; Zhou, Z.Q. Enhancing autonomous driving simulations: A hybrid metamorphic testing framework with metamorphic relations generated by GPT. Inf. Softw. Technol. 2025, 187, 107828. [Google Scholar] [CrossRef]
- Altin, M.; Mutlu, B.; Kilinc, D.; Cakir, A. Automated Testing for Service-Oriented Architecture: Leveraging Large Language Models for Enhanced Service Composition. IEEE Access 2025, 13, 89627–89640. [Google Scholar] [CrossRef]
- De Siano, G.D.; Fasolino, A.R.; Sperlí, G.; Vignali, A. Translating code with Large Language Models and human-in-the-loop feedback. Inf. Softw. Technol. 2025, 186, 107785. [Google Scholar] [CrossRef]
- Sasaki, Y.; Washizaki, H.; Li, J.; Sander, D.; Yoshioka, N.; Fukazawa, Y. Systematic literature review of prompt engineering patterns in software engineering. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 670–675. [Google Scholar]
- Felizardo, K.R.; Steinmacher, I.; Lima, M.S.; Deizepe, A.; Conte, T.U.; Barcellos, M.P. Data extraction for systematic mapping study using a large language model-a proof-of-concept study in software engineering. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Barcelona, Spain, 24–25 October 2024; pp. 407–413. [Google Scholar]
- Khan, Z.U.; Nasim, B.; Rasheed, Z. Generative AI-based predictive maintenance in aviation: A systematic literature review. Ceas Aeronaut. J. 2025, 16, 537–555. [Google Scholar] [CrossRef]
- Garcia, M.B. Teaching and learning computer programming using ChatGPT: A rapid review of literature amid the rise of generative AI technologies. Educ. Inf. Technol. 2025, 30, 16721–16745. [Google Scholar] [CrossRef]
- Lin, F.; Kim, D.J. Soen-101: Code generation by emulating software process models using large language model agents. arXiv 2024, arXiv:2403.15852. [Google Scholar]
- Li, X.; Yuan, S.; Gu, X.; Chen, Y.; Shen, B. Few-shot code translation via task-adapted prompt learning. J. Syst. Softw. 2024, 212, 112002. [Google Scholar] [CrossRef]
- Pornprasit, C.; Tantithamthavorn, C. Fine-tuning and prompt engineering for large language models-based code review automation. Inf. Softw. Technol. 2024, 175, 107523. [Google Scholar] [CrossRef]
- Yang, Z.; Keung, J.W.; Sun, Z.; Zhao, Y.; Li, G.; Jin, Z.; Liu, S.; Li, Y. Improving domain-specific neural code generation with few-shot meta-learning. Inf. Softw. Technol. 2024, 166, 107365. [Google Scholar] [CrossRef]
- Yun, S.; Lin, S.; Gu, X.; Shen, B. Project-specific code summarization with in-context learning. J. Syst. Softw. 2024, 216, 112149. [Google Scholar] [CrossRef]
- Eagal, A.; Stolee, K.T.; Ore, J.P. Analyzing the dependability of Large Language Models for code clone generation. J. Syst. Softw. 2025, 230, 112548. [Google Scholar] [CrossRef]
- Pan, Y.; Lyu, C.; Yang, Z.; Li, L.; Liu, Q.; Shao, X. E-code: Mastering efficient code generation through pretrained models and expert encoder group. Inf. Softw. Technol. 2025, 178, 107602. [Google Scholar] [CrossRef]
- Liu, J.; Li, C.; Chen, R.; Li, S.; Gu, B.; Yang, M. STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs. Proc. ACM Softw. Eng. 2025, 2, 2113–2135. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, K.; Li, G.; Jin, Z. Hits: High-coverage llm-based unit test generation via method slicing. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1258–1268. [Google Scholar]
- Su, C.Y.; Bansal, A.; Huang, Y.; Li, T.J.J.; McMillan, C. Context-aware code summary generation. J. Syst. Softw. 2025, 231, 112580. [Google Scholar] [CrossRef]
- Kim, D.K.; Ming, H. Assessing output reliability and similarity of large language models in software development: A comparative case study approach. Inf. Softw. Technol. 2025, 185, 107787. [Google Scholar] [CrossRef]
- Zhang, Z.; Wang, C.; Wang, Y.; Shi, E.; Ma, Y.; Zhong, W.; Chen, J.; Mao, M.; Zheng, Z. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation. Proc. ACM Softw. Eng. 2025, 2, 481–503. [Google Scholar] [CrossRef]
- Khanshan, A.; Van Gorp, P.; Markopoulos, P. Evaluation of Code Generation for Simulating Participant Behavior in Experience Sampling Method by Iterative In-Context Learning of a Large Language Model. Proc. ACM Hum.-Comput. Interact. 2024, 8, 255. [Google Scholar] [CrossRef]
- Wang, J.; Liu, S.; Xie, X.; Li, Y. An empirical study to evaluate AIGC detectors on code content. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 844–856. [Google Scholar]
- Firouzi, E.; Ghafari, M. Time to separate from StackOverflow and match with ChatGPT for encryption. J. Syst. Softw. 2024, 216, 112135. [Google Scholar] [CrossRef]
- Qu, Y.; Huang, S.; Chen, X.; Bai, T.; Yao, Y. An input-denoising-based defense against stealthy backdoor attacks in large language models for code. Inf. Softw. Technol. 2025, 180, 107661. [Google Scholar] [CrossRef]
- Moumoula, M.B.; Kabore, A.K.; Klein, J.; Bissyande, T.F. Cross-lingual Code Clone Detection: When LLMs Fail Short Against Embedding-based Classifier. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November; ACM: New York, NY, USA, 2024; pp. 2474–2475. [Google Scholar] [CrossRef]
- Durán, F.; Martinez, M.; Lago, P.; Martínez-Fernández, S. Insights into resource utilization of code small language models serving with runtime engines and execution providers. J. Syst. Softw. 2025, 230, 112574. [Google Scholar] [CrossRef]
- Voria, G.; Casillo, F.; Gravino, C.; Catolino, G.; Palomba, F. RECOVER: Toward Requirements Generation from Stakeholders’ Conversations. IEEE Trans. Softw. Eng. 2025, 51, 1912–1933. [Google Scholar] [CrossRef]
- Nikolakopoulos, A.; Litke, A.; Psychas, A.; Veroni, E.; Varvarigou, T. Exploring the potential of offline llms in data science: A study on code generation for data analysis. IEEE Access 2025, 13, 64087–64114. [Google Scholar] [CrossRef]
- Schäfer, M.; Nadi, S.; Eghbali, A.; Tip, F. An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Softw. Eng. 2023, 50, 85–105. [Google Scholar] [CrossRef]
- Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359. [Google Scholar] [CrossRef]
- Rahman, S.; Kuhar, S.; Cirisci, B.; Garg, P.; Wang, S.; Ma, X.; Deoras, A.; Ray, B. UTFix: Change aware unit test repairing using LLM. Proc. ACM Program. Lang. 2025, 9, 143–168. [Google Scholar] [CrossRef]
- Ardimento, P.; Capuzzimati, M.; Casalino, G.; Schicchi, D.; Taibi, D. A novel LLM-based classifier for predicting bug-fixing time in Bug Tracking Systems. J. Syst. Softw. 2025, 230, 112569. [Google Scholar] [CrossRef]
- Nguyen, T.T.; Vu, T.T.; Vo, H.D.; Nguyen, S. An empirical study on capability of Large Language Models in understanding code semantics. Inf. Softw. Technol. 2025, 185, 107780. [Google Scholar] [CrossRef]
- Cotroneo, D.; Foggia, A.; Improta, C.; Liguori, P.; Natella, R. Automating the correctness assessment of AI-generated code for security contexts. J. Syst. Softw. 2024, 216, 112113. [Google Scholar] [CrossRef]
- Moumoula, M.B.; Kaboré, A.K.; Klein, J.; Bissyandé, T.F. The Struggles of LLMs in Cross-Lingual Code Clone Detection. Proc. ACM Softw. Eng. 2025, 2, 1023–1045. [Google Scholar] [CrossRef]
- Dil, C.; Chen, H.; Damevski, K. Towards higher quality software vulnerability data using LLM-based patch filtering. J. Syst. Softw. 2025, 230, 112581. [Google Scholar] [CrossRef]
- Wu, Y.; Li, Z.; Zhang, J.M.; Liu, Y. Condefects: A complementary dataset to address the data leakage concern for llm-based fault localization and program repair. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, Porto de Galinhas, Brazil, 15–19 July 2024; pp. 642–646. [Google Scholar]
- Wang, R.; Guo, J.; Gao, C.; Fan, G.; Chong, C.Y.; Xia, X. Can llms replace human evaluators? An empirical study of llm-as-a-judge in software engineering. Proc. ACM Softw. Eng. 2025, 2, 1955–1977. [Google Scholar] [CrossRef]
- Kalouptsoglou, I.; Siavvas, M.; Ampatzoglou, A.; Kehagias, D.; Chatzigeorgiou, A. Transfer learning for software vulnerability prediction using Transformer models. J. Syst. Softw. 2025, 227, 112448. [Google Scholar] [CrossRef]
- Xiong, H.; Yang, Y.; Wu, H.; Zhong, X.; Tang, Y.; Xia, Z.; Wang, X.; Yan, J. Reinvent the Operation not the Architecture: Quantum-inspired High-order Product for Compatible and Improved LLMs Training. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, Canada, 3–7 August 2025; Volume 2, pp. 3356–3365. [Google Scholar]
- Tu, H.; Zhou, Z.; Jiang, H.; Yusuf, I.N.B.; Li, Y.; Jiang, L. Isolating compiler bugs by generating effective witness programs with large language models. IEEE Trans. Softw. Eng. 2024, 50, 1768–1788. [Google Scholar] [CrossRef]
- Ge, C.; Wang, T.; Yang, X.; Treude, C. Cross-Level Requirements Tracing Based on Large Language Models. IEEE Trans. Softw. Eng. 2025, 51, 2044–2066. [Google Scholar] [CrossRef]
- Fazelnia, M.; Mirakhorli, M.; Bagheri, H. Translation titans, reasoning challenges: Satisfiability-aided language models for detecting conflicting requirements. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2294–2298. [Google Scholar]
- Hassani, S. Enhancing legal compliance and regulation analysis with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 507–511. [Google Scholar]
- Wu, J.J.; Fard, F.H. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agent. arXiv 2024, arXiv:2406.00215. [Google Scholar] [CrossRef]
- Tagliaferro, A.; Corboe, S.; Guindani, B. Leveraging LLMs to Automate Software Architecture Design from Informal Specifications. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 291–299. [Google Scholar]
- Duarte, C.E. Automated Microservice Pattern Instance Detection Using Infrastructure-as-Code Artifacts and Large Language Models. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 161–166. [Google Scholar]
- Ou, Y.; Su, C.; Chen, L.; Li, Y.; Zhou, Y. Binding of C++ and JavaScript through automated glue code generation. J. Syst. Softw. 2025, 230, 112565. [Google Scholar] [CrossRef]
- Guo, L.; Wang, Y.; Shi, E.; Zhong, W.; Zhang, H.; Chen, J.; Zhang, R.; Ma, Y.; Zheng, Z. When to stop? towards efficient code generation in llms with excess token prevention. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1073–1085. [Google Scholar]
- Yu, Z.; Li, C.; Zhang, Y.; Liu, M.; Pinckney, N.; Zhou, W.; Yang, H.; Liang, R.; Ren, H.; Lin, Y.C. LLM4HWDesign Contest: Constructing a Comprehensive Dataset for LLM-Assisted Hardware Code Generation with Community Efforts. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 1 October–1 November 2024; pp. 1–5. [Google Scholar]
- Birillo, A.; Artser, E.; Potriasaeva, A.; Vlasov, I.; Dzialets, K.; Golubev, Y.; Gerasimov, I.; Keuning, H.; Bryksin, T. One step at a time: Combining llms and static analysis to generate next-step hints for programming tasks. In Proceedings of the 24th Koli Calling International Conference on Computing Education Research, Koli, Finland, 12–17 November 2024; pp. 1–12. [Google Scholar]
- Almanasra, S.; Suwais, K. Analysis of ChatGPT-generated codes across multiple programming languages. IEEE Access 2025, 13, 23580–23596. [Google Scholar] [CrossRef]
- Luo, Y.; Yu, R.; Zhang, F.; Liang, L.; Xiong, Y. Bridging gaps in llm code translation: Reducing errors with call graphs and bridged debuggers. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2448–2449. [Google Scholar]
- Imran, M.M.; Chatterjee, P.; Damevski, K. Shedding light on software engineering-specific metaphors and idioms. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
- Yu, L.; Huang, Z.; Yuan, H.; Cheng, S.; Yang, L.; Zhang, F.; Shen, C.; Ma, J.; Zhang, J.; Lu, J.; et al. Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection. Proc. ACM Softw. Eng. 2025, 2, 182–205. [Google Scholar] [CrossRef]
- Zhu, Y.; Yu, S.; Zong, Z.; Wang, Y.; Zhao, Y.; Chen, Z. Text-image fusion template for large language model assisted crowdsourcing test aggregation. J. Syst. Softw. 2025, 228, 112478. [Google Scholar] [CrossRef]
- Bukhary, N.; Ahmad, M.; Rashad, K.; Rai, S.; Shapsough, S.; Kaddoura, Y.; Dghaym, D.; Zualkernan, I. Few-Shot Evaluation of Vision Language Models for Detecting Visual Defects in Autonomous Vehicle Software Requirement Specifications. IEEE Access 2025, 13, 117914–117942. [Google Scholar] [CrossRef]
- Xiang, B.; Shao, Y. SUMLLAMA: Efficient Contrastive Representations and Fine-Tuned Adapters for Bug Report Summarization. IEEE Access 2024, 12, 78562–78571. [Google Scholar] [CrossRef]
- Sun, T.; Xu, J.; Li, Y.; Yan, Z.; Zhang, G.; Xie, L.; Geng, L.; Wang, Z.; Chen, Y.; Lin, Q.; et al. Bitsai-cr: Automated code review via llm in practice. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 274–285. [Google Scholar]
- Li, Y.; Liu, B.; Zhang, T.; Wang, Z.; Lo, D.; Yang, L.; Lyu, J.; Zhang, H. A Knowledge Enhanced Large Language Model for Bug Localization. Proc. ACM Softw. Eng. 2025, 2, 1914–1936. [Google Scholar] [CrossRef]
- Boi, B.; Esposito, C.; Lee, S. Smart contract vulnerability detection: The role of large language model (llm). ACM SIGAPP Appl. Comput. Rev. 2024, 24, 19–29. [Google Scholar] [CrossRef]
- Kessel, M.; Atkinson, C. Promoting open science in test-driven software experiments. J. Syst. Softw. 2024, 212, 111971. [Google Scholar] [CrossRef]
- Bin Murtaza, S.; Mccoy, A.; Ren, Z.; Murphy, A.; Banzhaf, W. Llm fault localisation within evolutionary computation based automated program repair. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, Melbourne, Australia, 14–18 July 2024; pp. 1824–1829. [Google Scholar]
- Ouedraogo, W.C.; Kabore, K.; Tian, H.; Song, Y.; Koyuncu, A.; Klein, J.; Lo, D.; Bissyande, T.F. Llms and prompting for unit test generation: A large-scale evaluation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2464–2465. [Google Scholar]
- Eshghie, M.; Artho, C. Oracle-guided vulnerability diversity and exploit synthesis of smart contracts using llms. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2240–2248. [Google Scholar]
- Huang, K.; Zhang, J.; Bao, X.; Wang, X.; Liu, Y. Comprehensive Fine-Tuning Large Language Models of Code for Automated Program Repair. IEEE Trans. Softw. Eng. 2025, 51, 904–928. [Google Scholar] [CrossRef]
- Soud, M.; Nuutinen, W.; Liebel, G. Sóley: Automated detection of logic vulnerabilities in Ethereum smart contracts using large language models. J. Syst. Softw. 2025, 226, 112406. [Google Scholar] [CrossRef]
- Li, X.; Wang, S.; Li, S.; Ma, J.; Yu, J.; Liu, X.; Wang, J.; Ji, B.; Zhang, W. Model editing for llms4code: How far are we? arXiv 2024, arXiv:2411.06638. [Google Scholar] [CrossRef]
- Kumar, J.; Chimalakonda, S. Code summarization without direct access to code-towards exploring federated llms for software engineering. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 100–109. [Google Scholar]
- Ahmed, T.; Devanbu, P. Few-shot training llms for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Rochester, MI, USA, 10–14 October 2022; pp. 1–5. [Google Scholar]
- Yang, Y.; Zhou, X.; Mao, R.; Xu, J.; Yang, L.; Zhang, Y.; Shen, H.; Zhang, H. DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection. J. Syst. Softw. 2025, 219, 112234. [Google Scholar] [CrossRef]
- Cai, Z.; Chen, J.; Chen, W.; Wang, W.; Zhu, X.; Ouyang, A. F-codellm: A federated learning framework for adapting large language models to practical software development. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 416–417. [Google Scholar]
- Xia, C.S.; Deng, Y.; Dunn, S.; Zhang, L. Demystifying llm-based software engineering agents. Proc. ACM Softw. Eng. 2025, 2, 801–824. [Google Scholar] [CrossRef]
- Alami, A.; Jensen, V.V.; Ernst, N.A. Accountability in code review: The role of intrinsic drivers and the impact of llms. ACM Trans. Softw. Eng. Methodol. 2025, 34, 233. [Google Scholar] [CrossRef]
- Cinkusz, K.; Chudziak, J.A. Towards LLM-augmented multiagent systems for agile software engineering. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2476–2477. [Google Scholar]
- Husain, M.; Khan, M.S.; Khan, J.A.; Khan, N.D.; Khan, A.; Akbar, M.A. Exploring Developers Discussion Forums for Quantum Software Engineering: A Fine-Grained Classification Approach Using Large Language Model (ChatGPT). In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1742–1755. [Google Scholar]
- Ahmed, T.; Pai, K.S.; Devanbu, P.; Barr, E.T. Automatic semantic augmentation of language model prompts (for code summarization). In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024; IEEE Computer Society: Washington, DC, USA, 2024; p. 1004. [Google Scholar]
- Zhang, Y.; Qiu, Z.; Stol, K.J.; Zhu, W.; Zhu, J.; Tian, Y.; Liu, H. Automatic commit message generation: A critical review and directions for future work. IEEE Trans. Softw. Eng. 2024, 50, 816–835. [Google Scholar] [CrossRef]
- Tufano, R.; Dabić, O.; Mastropaolo, A.; Ciniselli, M.; Bavota, G. Code review automation: Strengths and weaknesses of the state of the art. IEEE Trans. Softw. Eng. 2024, 50, 338–353. [Google Scholar] [CrossRef]
- Estévez-Ayres, I.; Callejo, P.; Hombrados-Herrera, M.Á.; Alario-Hoyos, C.; Delgado Kloos, C. Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming. Int. J. Artif. Intell. Educ. 2024, 35, 774–790. [Google Scholar] [CrossRef]
- Choi, S.; Kim, H. The impact of a large language model-based programming learning environment on students’ motivation and programming ability. Educ. Inf. Technol. 2024, 30, 8109–8138. [Google Scholar] [CrossRef]
- Ahmed, T.; Devanbu, P. Better patching using llm prompting, via self-consistency. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1742–1746. [Google Scholar]
- Ságodi, Z.; Siket, I.; Ferenc, R. Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and copilot. IEEE Access 2024, 12, 72303–72316. [Google Scholar] [CrossRef]
- Hassani, S.; Sabetzadeh, M.; Amyot, D.; Liao, J. Rethinking legal compliance automation: Opportunities with large language models. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 432–440. [Google Scholar]
- Colavito, G.; Lanubile, F.; Novielli, N. Benchmarking large language models for automated labeling: The case of issue report classification. Inf. Softw. Technol. 2025, 184, 107758. [Google Scholar] [CrossRef]
- Cai, Y.; Liang, P.; Wang, Y.; Li, Z.; Shahin, M. Demystifying issues, causes and solutions in LLM open-source projects. J. Syst. Softw. 2025, 227, 112452. [Google Scholar] [CrossRef]
- Yan, M.; Chen, J.; Zhang, J.M.; Cao, X.; Yang, C.; Harman, M. Robustness evaluation of code generation systems via concretizing instructions. Inf. Softw. Technol. 2025, 179, 107645. [Google Scholar] [CrossRef]
- Ma, Q.; Peng, W.; Yang, C.; Shen, H.; Koedinger, K.; Wu, T. What should we engineer in prompts? training humans in requirement-driven llm use. ACM Trans. Comput.-Hum. Interact. 2025, 32, 41. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
- Xia, Y.; Xiao, Z.; Jazdi, N.; Weyrich, M. Generation of asset administration shell with large language model agents: Toward semantic interoperability in digital twins in the context of industry 4.0. IEEE Access 2024, 12, 84863–84877. [Google Scholar] [CrossRef]
- Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Technical Report; ver. 2.3; EBSE Technical Report. EBSE. 2007. Available online: https://www.researchgate.net/publication/302924724_Guidelines_for_performing_Systematic_Literature_Reviews_in_Software_Engineering (accessed on 4 December 2025).
- Bouzenia, I.; Devanbu, P.; Pradel, M. Repairagent: An autonomous, llm-based agent for program repair. arXiv 2024, arXiv:2403.17134. [Google Scholar]
- Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar] [CrossRef]
- Braconaro, E.; Losiouk, E. A Dataset for Evaluating LLMs Vulnerability Repair Performance in Android Applications: Data/Toolset paper. In Proceedings of the Fifteenth ACM Conference on Data and Application Security and Privacy, Pittsburgh, PA, USA, 4–6 June 2024; pp. 353–358. [Google Scholar]
- Hagel, N.; Hili, N.; Bartel, A.; Koziolek, A. Towards LLM-Powered Consistency in Model-Based Low-Code Platforms. In Proceedings of the 2025 IEEE 22nd International Conference on Software Architecture Companion (ICSA-C), Odense, Denmark, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 364–369. [Google Scholar]
- Muttillo, V.; Di Sipio, C.; Rubei, R.; Berardinelli, L.; Dehghani, M. Towards synthetic trace generation of modeling operations using in-context learning approach. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 619–630. [Google Scholar]
- van Can, A.T.; Dalpiaz, F. Locating requirements in backlog items: Content analysis and experiments with large language models. Inf. Softw. Technol. 2025, 179, 107644. [Google Scholar] [CrossRef]
- Hassine, J. An llm-based approach to recover traceability links between security requirements and goal models. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 643–651. [Google Scholar]
- Anandayuvaraj, D.; Campbell, M.; Tewari, A.; Davis, J.C. Fail: Analyzing software failures from the news using llms. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 506–518. [Google Scholar]
- Chomątek, Ł.; Papuga, J.; Nowak, P.; Poniszewska-Marańda, A. Decoding CI/CD Practices in Open-Source Projects with LLM Insights. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 1638–1644. [Google Scholar]
- Ling, Y.; Yu, S.; Fang, C.; Pan, G.; Wang, J.; Liu, J. Redefining crowdsourced test report prioritization: An innovative approach with large language model. Inf. Softw. Technol. 2025, 179, 107629. [Google Scholar] [CrossRef]
- Almatrafi, A.A.; Eassa, F.A.; Sharaf, S.A. Code clone detection techniques based on large language models. IEEE Access 2025, 13, 46136–46146. [Google Scholar] [CrossRef]
- Nashaat, M.; Amin, R.; Eid, A.H.; Abdel-Kader, R.F. An enhanced transformer-based framework for interpretable code clone detection. J. Syst. Softw. 2025, 222, 112347. [Google Scholar] [CrossRef]
- Mandli, A.R.; Rajput, S.; Sharma, T. COMET: Generating commit messages using delta graph context representation. J. Syst. Softw. 2025, 222, 112307. [Google Scholar] [CrossRef]
- Kumar, A.; Sankar, S.; Das, P.P.; Chakrabarti, P.P. Using Large Language Models for multi-level commit message generation for large diffs. Inf. Softw. Technol. 2025, 187, 107831. [Google Scholar] [CrossRef]
- Vijayaraghavan, P.; Nitsure, A.; Mackin, C.; Shi, L.; Ambrogio, S.; Haran, A.; Paruthi, V.; Elzein, A.; Coops, D.; Beymer, D.; et al. Chain-of-descriptions: Improving code llms for vhdl code generation and summarization. In Proceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, Salt Lake City, UT, USA, 9–11 September 2024; pp. 1–10. [Google Scholar]
- Qayyum, K.; Jha, C.K.; Ahmadi-Pour, S.; Hassan, M.; Drechsler, R. LLM-assisted Bug Identification and Correction for Verilog HDL. ACM Trans. Des. Autom. Electron. Syst. 2025, 30, 101. [Google Scholar] [CrossRef]
- Gramacki, P.; Martins, B.; Szymański, P. Evaluation of code llms on geospatial code generation. In Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, Atlanta, GA, USA, 29 October–1 November 2024; pp. 54–62. [Google Scholar]
- Koziolek, H.; Koziolek, A. Llm-based control code generation using image recognition. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 38–45. [Google Scholar]
- Ko, E.; Kang, P. Evaluating Coding Proficiency of Large Language Models: An Investigation Through Machine Learning Problems. IEEE Access 2025, 13, 52925–52938. [Google Scholar] [CrossRef]
- Bhatia, S.; Gandhi, T.; Kumar, D.; Jalote, P. Unit test generation using generative AI: A comparative performance analysis of autogeneration tools. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 54–61. [Google Scholar]
- Mathews, N.S.; Nagappan, M. Test-driven development and llm-based code generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1583–1594. [Google Scholar]
- Takerngsaksiri, W.; Charakorn, R.; Tantithamthavorn, C.; Li, Y.F. Pytester: Deep reinforcement learning for text-to-testcase generation. J. Syst. Softw. 2025, 224, 112381. [Google Scholar] [CrossRef]
- Lops, A.; Narducci, F.; Ragone, A.; Trizio, M. AgoneTest: Automated creation and assessment of Unit tests leveraging Large Language Models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2440–2441. [Google Scholar]
- Yang, L.; Yang, C.; Gao, S.; Wang, W.; Wang, B.; Zhu, Q.; Chu, X.; Zhou, J.; Liang, G.; Wang, Q.; et al. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1607–1619. [Google Scholar]
- Wu, H.W.; Lee, S.J. Can Large Language Model Aid in Generating Properties for UPPAAL Timed Automata? A Case Study. In Proceedings of the 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), Osaka, Japan, 2–4 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 2248–2253. [Google Scholar]
- Ma, L.; Liu, S.; Li, Y.; Xie, X.; Bu, L. Specgen: Automated generation of formal program specifications via large language models. arXiv 2024, arXiv:2401.08807. [Google Scholar] [CrossRef]
- Yang, A.Z.; Le Goues, C.; Martins, R.; Hellendoorn, V. Large language models for test-free fault localization. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–12. [Google Scholar]
- Ji, S.; Lee, S.; Lee, C.; Han, Y.S.; Im, H. Impact of Large Language Models of Code on Fault Localization. In Proceedings of the 2025 IEEE Conference on Software Testing, Verification and Validation (ICST), Naples, Italy, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 302–313. [Google Scholar]
- Ji, Z.; Ma, P.; Li, Z.; Wang, Z.; Wang, S. Causality-Aided Evaluation and Explanation of Large Language Model-Based Code Generation. Proc. ACM Softw. Eng. 2025, 2, 1374–1397. [Google Scholar] [CrossRef]
- Kang, S.; An, G.; Yoo, S. A quantitative and qualitative evaluation of LLM-based explainable fault localization. Proc. ACM Softw. Eng. 2024, 1, 1424–1446. [Google Scholar] [CrossRef]
- Etemadi, K.; Mohammadi, B.; Su, Z.; Monperrus, M. Mokav: Execution-driven differential testing with llms. J. Syst. Softw. 2025, 230, 112571. [Google Scholar] [CrossRef]
- Feng, S.; Lu, H.; Jiang, J.; Xiong, T.; Huang, L.; Liang, Y.; Li, X.; Deng, Y.; Aleti, A. Enabling Cost-Effective UI Automation Testing with Retrieval-Based LLMs: A Case Study in WeChat. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1973–1978. [Google Scholar]
- Xue, Z.; Li, L.; Tian, S.; Chen, X.; Li, P.; Chen, L.; Jiang, T.; Zhang, M. Llm4fin: Fully automating llm-powered test case generation for fintech software acceptance testing. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1643–1655. [Google Scholar]
- Patel, S.; Yadavally, A.; Dhulipala, H.; Nguyen, T. Planning a Large Language Model for Static Detection of Runtime Errors in Code Snippets. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2025; p. 639. [Google Scholar]
- Rong, G.; Yu, Y.; Liu, S.; Tan, X.; Zhang, T.; Shen, H.; Hu, J. Code Comment Inconsistency Detection and Rectification Using a Large Language Model. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), Ottawa, ON, Canada, 27 April–3 May 2025; IEEE Computer Society: Washington, DC, USA, 2024; pp. 432–443. [Google Scholar]
- Wen, C.; Cai, Y.; Zhang, B.; Su, J.; Xu, Z.; Liu, D.; Qin, S.; Ming, Z.; Cong, T. Automatically inspecting thousands of static bug warnings with large language model: How far are we? ACM Trans. Knowl. Discov. Data 2024, 18, 168. [Google Scholar] [CrossRef]
- Cheng, B.; Zhang, C.; Wang, K.; Shi, L.; Liu, Y.; Wang, H.; Guo, Y.; Li, D.; Chen, X. Semantic-enhanced indirect call analysis with large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 430–442. [Google Scholar]
- Wu, C.; Chen, J.; Wang, Z.; Liang, R.; Du, R. Semantic sleuth: Identifying ponzi contracts via large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 582–593. [Google Scholar]
- Jiang, Z.; Wen, M.; Cao, J.; Shi, X.; Jin, H. Towards understanding the effectiveness of large language models on directed test input generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1408–1420. [Google Scholar]
- Ferrag, M.A.; Battah, A.; Tihanyi, N.; Jain, R.; Maimuţ, D.; Alwahedi, F.; Lestable, T.; Thandi, N.S.; Mechri, A.; Debbah, M.; et al. Securefalcon: Are we there yet in automated software vulnerability detection with llms? IEEE Trans. Softw. Eng. 2025, 51, 1248–1265. [Google Scholar] [CrossRef]
- Yang, S.; Lin, X.; Chen, J.; Zhong, Q.; Xiao, L.; Huang, R.; Wang, Y.; Zheng, Z. Hyperion: Unveiling dapp inconsistencies using llm and dataflow-guided symbolic execution. arXiv 2024, arXiv:2408.06037. [Google Scholar] [CrossRef]
- Wu, Y.; Xie, X.; Peng, C.; Liu, D.; Wu, H.; Fan, M.; Liu, T.; Wang, H. Advscanner: Generating adversarial smart contracts to exploit reentrancy vulnerabilities using llm and static analysis. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1019–1031. [Google Scholar]
- Wang, C.; Zhang, J.; Gao, J.; Xia, L.; Guan, Z.; Chen, Z. Contracttinker: Llm-empowered vulnerability repair for real-world smart contracts. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 2350–2353. [Google Scholar]
- Acharya, J.; Ginde, G. Graph neural network vs. large language model: A comparative analysis for bug report priority and severity prediction. In Proceedings of the 20th International Conference on Predictive Models and Data Analytics in Software Engineering, Porto de Galinhas, Brazil, 16 July 2024; pp. 2–11. [Google Scholar]
- Ramler, R.; Straubinger, P.; Plösch, R.; Winkler, D. Unit Testing Past vs. Present: Examining LLMs’ Impact on Defect Detection and Efficiency. arXiv 2025, arXiv:2502.09801. [Google Scholar] [CrossRef]
- Al-Turki, D.; Hettiarachchi, H.; Gaber, M.M.; Abdelsamea, M.M.; Basurra, S.; Iranmanesh, S.; Saadany, H.; Vakaj, E. Human-in-the-Loop learning with LLMs for efficient RASE tagging in building compliance regulations. IEEE Access 2024, 12, 185291–185306. [Google Scholar] [CrossRef]
- Zamfirescu-Pereira, J.; Jun, E.; Terry, M.; Yang, Q.; Hartmann, B. Beyond code generation: Llm-supported exploration of the program design space. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–17. [Google Scholar]
- Sun, Y.; Hao, B.; Wang, X.; Zhao, C.; Zhao, Y.; Shi, B.; Zhang, S.; Ge, Q.; Li, W.; Wei, H.; et al. LLM-Augmented Ticket Aggregation for Low-cost Mobile OS Defect Resolution. In Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, Trondheim, Norway, 23–28 June 2025; pp. 215–226. [Google Scholar]
- Kang, S.; Yoon, J.; Yoo, S. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2312–2323. [Google Scholar]
- Kang, S.; Chen, B.; Yoo, S.; Lou, J.G. Explainable automated debugging via large language model-driven scientific debugging. Empir. Softw. Eng. 2024, 30, 45. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, Z.; Feng, Y.; Xu, B. Leveraging large language model to assist detecting rust code comment inconsistency. In Proceedings of the 39th IEEE/ACM international conference on automated software engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 356–366. [Google Scholar]
- Mandal, U.; Shukla, S.; Rastogi, A.; Bhattacharya, S.; Mukhopadhyay, D. μLAM: A LLM-Powered Assistant for Real-Time Micro-architectural Attack Detection and Mitigation. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, New York, NY, USA, 1 October–1 November 2024; pp. 1–9. [Google Scholar]
- Xia, Y.; Shao, H.; Deng, X. Vulcobert: A codebert-based system for source code vulnerability detection. In Proceedings of the 2024 International Conference on Generative Artificial Intelligence and Information Security, Kuala Lumpur, Malaysia, 10–12 May 2024; pp. 249–252. [Google Scholar]
- Zhao, Y.; Gong, L.; Huang, Z.; Wang, Y.; Wei, M.; Wu, F. Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; ACM: New York, NY, USA, 2024; pp. 1732–1744. [Google Scholar] [CrossRef]
- Gao, C.; Chen, X.; Zhang, G. SVA-ICL: Improving LLM-based software vulnerability assessment via in-context learning and information fusion. Inf. Softw. Technol. 2025, 186, 107803. [Google Scholar] [CrossRef]
- Yang, X.; Zhu, W.; Pacheco, M.; Zhou, J.; Wang, S.; Hu, X.; Liu, K. Code Change Intention, Development Artifact, and History Vulnerability: Putting Them Together for Vulnerability Fix Detection by LLM. Proc. ACM Softw. Eng. 2025, 2, 489–510. [Google Scholar] [CrossRef]
- Nangia, A.; Ayachitula, S.; Kundu, C. In-Context Vulnerability Propagation in LLMs [Work In Progress Paper]. In Proceedings of the 30th ACM Symposium on Access Control Models and Technologies, Stony Brook, NY, USA, 8–10 July 2025; pp. 169–174. [Google Scholar]
- Wu, Y.; Wen, M.; Yu, Z.; Guo, X.; Jin, H. Effective vulnerable function identification based on cve description empowered by large language models. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 393–405. [Google Scholar]
- Aljedaani, W.; Eler, M.M.; Parthasarathy, P. Enhancing accessibility in software engineering projects with large language models (llms). In Proceedings of the 56th ACM Technical Symposium on Computer Science Education, Pittsburgh, PA, USA, 26 February–1 March 2025; Volume 1, pp. 25–31. [Google Scholar]
- Oertel, J.; Jil Klünder, R.H. Don’t settle for the first! How many GitHub Copilot solutions should you check? Inf. Softw. Technol. 2025, 183, 107737. [Google Scholar] [CrossRef]
- Yang, R.; Fu, M.; Tantithamthavorn, C.; Arora, C.; Vandenhurk, L.; Chua, J. RAGVA: Engineering retrieval augmented generation-based virtual assistants in practice. arXiv 2025, arXiv:2502.14930. [Google Scholar] [CrossRef]
- Kulsum, U.; Zhu, H.; Xu, B.; d’Amorim, M. A case study of llm for automated vulnerability repair: Assessing impact of reasoning and patch validation feedback. In Proceedings of the 1st ACM International Conference on AI-Powered Software, Porto de Galinhas, Brazil, 15–16 July 2024; pp. 103–111. [Google Scholar]
- Yadav, D.; Mondal, S. Evaluating Pre-trained Large Language Models on Zero Shot Prompts for Parallelization of Source Code. J. Syst. Softw. 2025, 230, 112543. [Google Scholar] [CrossRef]
- Dong, J.; Sun, J.; Zhang, W.; Dong, J.S.; Hao, D. ConTested: Consistency-Aided Tested Code Generation with LLM. Proc. ACM Softw. Eng. 2025, 2, 596–617. [Google Scholar] [CrossRef]
- Evtikhiev, M.; Bogomolov, E.; Sokolov, Y.; Bryksin, T. Out of the BLEU: How should we assess quality of the Code Generation models? J. Syst. Softw. 2023, 203, 111741. [Google Scholar] [CrossRef]
- Mansur, E.; Chen, J.; Raza, M.A.; Wardat, M. RAGFix: Enhancing LLM Code Repair Using RAG and Stack Overflow Posts. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington DC, USA, 15–18 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7491–7496. [Google Scholar]
- Tomic, S.; Alégroth, E.; Isaac, M. Evaluation of the Choice of LLM in a Multi-agent Solution for GUI-Test Generation. In Proceedings of the 2025 IEEE Conference on Software Testing, Verification and Validation (ICST), Naples, Italy, 31 March–4 April 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 487–497. [Google Scholar]
- Chapman, P.J.; Rubio-González, C.; Thakur, A.V. Interleaving static analysis and llm prompting. In Proceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, Copenhagen, Denmark, 25 June 2024; pp. 9–17. [Google Scholar]
- Zhou, X.; Zhang, T.; Lo, D. Large language model for vulnerability detection: Emerging results and future directions. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, Lisbon Portugal, 14–20 April 2024; pp. 47–51. [Google Scholar]
- Sergeyuk, A.; Golubev, Y.; Bryksin, T.; Ahmed, I. Using AI-based coding assistants in practice: State of affairs, perceptions, and ways forward. Inf. Softw. Technol. 2025, 178, 107610. [Google Scholar] [CrossRef]
- Li, J.; Liu, S.; Jin, Z. Automated formal-specification-to-code trace links recovery using multi-dimensional similarity measures. J. Syst. Softw. 2025, 226, 112439. [Google Scholar] [CrossRef]
- Lai, C.; Zhou, Z.; Poptani, A.; Zhang, W. Lcm: Llm-focused hybrid spm-cache architecture with cache management for multi-core ai accelerators. In Proceedings of the 38th ACM International Conference on Supercomputing, Kyoto, Japan, 4–7 June 2024; pp. 62–73. [Google Scholar]
- Choudhuri, R.; Liu, D.; Steinmacher, I.; Gerosa, M.; Sarma, A. How far are we? the triumphs and trials of generative ai in learning software engineering. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
- Nguyen, P.T.; Di Rocco, J.; Di Sipio, C.; Rubei, R.; Di Ruscio, D.; Di Penta, M. GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT. J. Syst. Softw. 2024, 214, 112059. [Google Scholar] [CrossRef]
- Pedroso, D.F.; Almeida, L.; Pulcinelli, L.E.G.; Aisawa, W.A.A.; Dutra, I.; Bruschi, S.M. Anomaly Detection and Root Cause Analysis in Cloud-Native Environments Using Large Language Models and Bayesian Networks. IEEE Access 2025, 13, 77550–77564. [Google Scholar] [CrossRef]
- Han, Y.; Du, Q.; Huang, Y.; Wu, J.; Tian, F.; He, C. The potential of one-shot failure root cause analysis: Collaboration of the large language model and small classifier. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 931–943. [Google Scholar]
- North, M.; Atapour-Abarghouei, A.; Bencomo, N. Code gradients: Towards automated traceability of llm-generated code. In Proceedings of the 2024 IEEE 32nd International Requirements Engineering Conference (RE), Reykjavik, Iceland, 24–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 321–329. [Google Scholar]
- Ali, M.; Giallousi, N.; Melidis, A.; Alexopoulos, C.; Charalabidis, Y. GlossAPI: Architecturing the Greek Data Pile for LLM development. In Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics, Athens, Greece, 13–15 December 2024; pp. 16–25. [Google Scholar]
- Xu, Z.; Kong, D.; Liu, J.; Li, J.; Hou, J.; Dai, X.; Li, C.; Wei, S.; Hu, Y.; Yin, S. WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, Tokyo, Japan, 21–25 June 2025; pp. 1–17. [Google Scholar]


| SE Phase | Common FM Task | Example Studies and Applications |
|---|---|---|
| Requirements | Requirement extraction, summarization, and translation | FMs support writing and checking requirement documents, tracing links, and translating user stories across languages or project teams [21,23,27,30,101,102,103,104]. |
| Design/ Architecture | Decision support, rationale explanation | Models help designers summarize design decisions, compare options, and explain architectural choices [3,32,33,105,106]. |
| Implementation/ Coding | Code generation, completion, and refactoring | FMs produce working code, translate between languages such as C++ and JavaScript, and suggest small fixes or cleanups [6,35,40,107,108,109,110,111,112]. |
| Testing/QA | Test generation, bug detection, log summarization | LLMs create unit tests, locate bugs, and summarize failure reports for easier debugging [54,88,89,113,114,115,116,117,118,119,120,121,122,123,124]. |
| Maintenance/ Evolution | Program repair, defect classification, refactoring | FMs suggest patches for faulty code, group related bug reports, and classify types of technical debt [41,53,125,126,127,128,129,130]. |
| Project/Process Management | Workflow support, prioritization, and coordination | Multi-agent LLM systems are used for agile planning, task triage, and summarizing project updates [131,132,133,134]. |
| Other/ Cross-cutting | Summarization and linking across artifacts | FMs connect related items, such as linking requirements to commits or summarizing large logs [19,20,32,135,136]. |
| Source | Total Records | Included (%) | Cumulative Retained |
|---|---|---|---|
| IEEE Xplore | 185 | 102 (55.1%) | 102 |
| ACM Digital Library | 112 | 97 (86.6%) | 197 |
| ScienceDirect | 223 | 65 (29.1%) | 262 |
| SpringerLink | 15 | 14 (93.3%) | 276 |
| Total (before deduplication) | 535 | 278 | – |
| Total (after deduplication) | – | 276 | 276 |
| Exclusion Reason |
|---|
| Not a software engineering task or artifact |
| No empirical evaluation or validation |
| Focuses on model development only (no SE task) |
| Unclear or duplicated study |
| Tool | Description/Use | Ref. |
|---|---|---|
| Codex | Basis of Copilot; code and test generation. | [6] |
| StarCoder | Open BigCode model for code tasks. | [8] |
| GPT-4 | Proprietary LLM; broad SE applications. | [7] |
| CodeT5+ | Transformer for code intelligence tasks. | [10] |
| RepairAgent | Autonomous LLM-based program repair. | [151] |
| Code Llama | Open LLaMA2-based family for code. | [9] |
| AlphaCode | Competitive programming system. | [152] |
| Dataset/Benchmark | Focus/Usage | Ref. |
|---|---|---|
| HumanEval | Python code generation benchmark. | [6] |
| BigCodeBench | Large-scale eval suite for StarCoder. | [8] |
| Defects4J | Program repair and fault localization. | [10] |
| TFix | Code fix dataset (JavaScript). | [10] |
| ConDefects | Leakage-aware repair/localization data. | [96] |
| MBPP | 974 curated Python problems. | [4] |
| CodeXGLUE | Multi-task benchmark suite. | [153] |
| CodeContests | Competitive programming problems. | [152] |
| Android vuln. repair | Android security repair (Java/XML) with human-validated fixes. | [154] |
| FM_Capability | Arch/Design | Bug/Defect | CodeGen | Summ. | Transl. | Repair | TestGen | Reqts | Other | Row_Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Primary_SE_Phase | ||||||||||
| Design/Arch | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 4 | 12 |
| Impl/Coding | 0 | 1 | 43 | 5 | 4 | 0 | 0 | 0 | 5 | 58 |
| Maint/Evol | 0 | 10 | 0 | 8 | 0 | 17 | 0 | 0 | 9 | 44 |
| Other | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 9 | 13 |
| Process Mgmt | 1 | 0 | 1 | 2 | 0 | 0 | 0 | 1 | 9 | 14 |
| Requirements | 0 | 0 | 1 | 3 | 0 | 0 | 0 | 14 | 1 | 19 |
| Testing/QA | 0 | 28 | 6 | 2 | 1 | 2 | 19 | 0 | 6 | 64 |
| Col_Total | 4 | 39 | 57 | 20 | 6 | 19 | 19 | 17 | 43 | 224 |
| Task | FM Capability | Refs. | Key Strength | Key Challenge |
|---|---|---|---|---|
| Unit test gen. | Code/test generation | [75,76,88,171,172,173,174,175] | Higher line/branch coverage; strong pass rates | Prompt/seed variance; coverage vs. correctness |
| Property/oracle gen. | Spec drafting & repair | [176,177] | Automates formal specs; augments human oracles | Oracle cost; verifier latency |
| Fault localization | Explanation/ranking | [178,179,180,181] | Beats ML baselines; works without tests | Stability; leakage across datasets |
| Differential testing | Behavioral comparison | [182] | Iterative execution feedback; high difference-exposing tests | Runtime harness cost; flaky diffs |
| UI/acceptance | Planning + RAG | [183,184] | Cost-effective automation; high scenario/code coverage | Grounding; latency; cost |
| Static + semantics | Summarization | [185,186,187,188,189,190] | Handles partial context; strengthens static triage | Indirect calls; partial context |
| Security | Detection | [114,191,192,193,194] | Line-level vuln. detection; interpretable explanations | Precision/recall vs. deployability |
| Human factors | Assistance & triage | [195,196,197,198,199] | Improves developer productivity and triage confidence | Trust; false positives; workflow fit |
| Challenge (C) | Corresponding Opportunity (O) | References |
|---|---|---|
| C1—Prompt/seed/model variance | O1—Add structure (method slicing, structured seeds) and RAG; O2—Use execution/verification loops to stabilise outputs | [54,55,64,75,76,88,182] |
| C2—Oracle construction & verifier latency | O2—Verifier-/execution-in-the-loop controllers; cached checks; selective verification | [55,176,177,182,215] |
| C3—Data leakage & comparability | O4—Time-sliced/complementary corpora; full prompt/seed reporting; multi-signal metrics | [19,20,49,96,216] |
| C4—Grounding/scale for UI & acceptance | O1—RAG over screens/DOM and business rules; cost controllers; process artifacts | [183,184,217,218] |
| C5—Static semantic gaps (indirect calls, partial code) | O2—CFG planners and semantic summaries; integrate with static analyses | [185,186,188,219] |
| C6—Security deployability (accuracy vs. latency) | O3—Compact/task-adapted models; hybrid LLM + analysis pipelines; on-prem CPU paths | [191,192,193,220] |
| C7—Integration & developer trust (IDE/CI, false positives) | O5—Human–AI collaboration patterns: explain–edit–enforce; rationale summaries; CI risk gates | [196,198,211,221] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Banitaan, S.; Daoud, M.; Alquran, H.; Akour, M. Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information 2026, 17, 73. https://doi.org/10.3390/info17010073
Banitaan S, Daoud M, Alquran H, Akour M. Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information. 2026; 17(1):73. https://doi.org/10.3390/info17010073
Chicago/Turabian StyleBanitaan, Shadi, Mohammad Daoud, Hebah Alquran, and Mohammad Akour. 2026. "Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support" Information 17, no. 1: 73. https://doi.org/10.3390/info17010073
APA StyleBanitaan, S., Daoud, M., Alquran, H., & Akour, M. (2026). Foundation Models in Software Engineering: A Taxonomy, Systematic Review, and In-Depth Analysis of Testing Support. Information, 17(1), 73. https://doi.org/10.3390/info17010073

