A Review of Large Language Models for Automated Test Case Generation

Celik, Arda; Mahmoud, Qusay H.

doi:10.3390/make7030097

Open AccessReview

A Review of Large Language Models for Automated Test Case Generation

by

Arda Celik

^*

and

Qusay H. Mahmoud

Department of Electrical, Computer and Software Engineering, Ontario Tech University, Oshawa, ON L1G 0C5, Canada

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 97; https://doi.org/10.3390/make7030097

Submission received: 13 June 2025 / Revised: 13 July 2025 / Accepted: 4 September 2025 / Published: 9 September 2025

Download

Browse Figures

Versions Notes

Abstract

Automated test case generation aims to improve software testing by reducing the manual effort required to create test cases. Recent advancements in large language models (LLMs), with their ability to understand natural language and generate code, have identified new opportunities to enhance this process. In this review, the focus is on the use of LLMs in test case generation to identify the effectiveness of the proposed methods compared with existing tools and potential directions for future research. A literature search was conducted using online resources, filtering the studies based on the defined inclusion and exclusion criteria. This paper presents the findings from the selected studies according to the three research questions and further categorizes the findings based on the common themes. These findings highlight the opportunities and challenges associated with the use of LLMs in this domain. Although improvements were observed in metrics such as test coverage, usability, and correctness, limitations such as inconsistent performance and compilation errors were highlighted. This provides a state-of-the-art review of LLM-based test case generation, emphasizing the potential of LLMs to improve automated testing while identifying areas for further advancements.

Keywords:

software testing; LLMs; generative AI; test case generation

Graphical Abstract

1. Introduction

Software testing is a process that focuses on evaluating the implementation of a system to identify potential failures and ensure reliability [1]. Testing has been a critical activity since the early days of computer systems, directly influencing the quality and dependability of software products and services [2]. With technological advancements driving the complexity of software systems across various domains, the need for robust testing has become even more evident [3]. Despite being the most costly and time-consuming phase of software development [2], insufficient testing significantly increases the risk of producing low-quality software, leading to financial losses, negative user experiences, and reputational damage to organizations [4]. To address these challenges, comprehensive testing approaches, including unit, integration, and end-to-end testing, have been developed to ensure that modern software systems satisfy these requirements and remain defect-free [5]. In response to the need for greater efficiency, automated testing has garnered significant interest from the research community [6] and the industry [7]. With the widespread adoption of agile methodologies, automated unit testing has become standard practice among developers [7]. Furthermore, advancements in Generative AI are expected to enhance and evolve automated testing processes, offering new opportunities to improve software quality and streamline development workflows [6].

Large language models (LLMs) have sparked significant interest in recent years and have been seen as promising coding assistants, especially in automated test case generation, because of their ability to understand natural language requirements and generate relevant code [8,9]. With their continuous evolution, LLMs have demonstrated exceptional capabilities in generating functional code from natural language descriptions, significantly enhancing software development efficiency by automating tasks, reducing human error, and allowing developers to focus on more complex and creative programming aspects [10].

This paper presents a review of the use of LLMs in automated test case generation. The main contributions of this study are as follows.

Analysis of proposed methods: A categorization of methods for using LLMs in test case generation is provided. These methods are classified into prompt design and engineering, feedback-driven approaches, model fine-tuning and pre-training, and hybrid approaches, which offer insights into their strengths and limitations.
Evaluation of effectiveness: An assessment of the effectiveness of LLMs relative to current testing tools was performed, emphasizing the instances where LLMs excel and fall short compared with existing tools.
Future research directions include identifying key areas for improvement, including expanding the applicability of LLMs to multiple programming languages, integrating domain-specific knowledge, and leveraging hybrid techniques to address the current challenges.

The remainder of this paper is organized as follows. Section 2 provides the necessary background for software testing and LLMs. Section 3 describes the review process used in this study. Section 4 presents the results of the research questions and discusses the limitations of existing methods and opportunities for future research. Finally, Section 5 concludes the paper.

2. Background

2.1. Large Language Models (LLMs)

LLMs utilize neural networks that contain a large number of parameters, which can exceed billions. They can be trained in a self-supervised manner, and their pre-training can include large corpora from the Web. They can learn complex patterns, language semantics, and nuances by using their training methodologies. These abilities make them great candidates for tackling various language-related tasks such as translation, text summarization, and sentiment analysis. Importantly, they can be fine-tuned for specialized tasks, which is a key feature that allows them to achieve improved results [11].

LLMs have significantly impacted the field of language modeling. They evolved from early language models and neural networks. In the past, statistical approaches and n-gram models were used, but they struggled to express long-term interdependence and context in a language. Recent advancements in neural networks have led to the development of recurrent neural networks (RNN) that can model sequential data. However, RNNs have limitations, owing to their vanishing gradients and long-term dependencies. A more recent breakthrough in LLMs was marked by the development of the Transformer Architecture, which efficiently manages long-range dependencies and understands word relationships through a self-attention mechanism [11]. This development opened the door to the generative pretrained transformer (GPT) model and attracted significant interest from industry, academia, and the general public [12].

Although LLMs have been widely applied to and found to be compelling in traditional language tasks, their application in software engineering has also gained significant attention. LLMs have been explored and studied at all stages of the software development lifecycle, including requirements engineering, design, development, testing, and maintenance [12].

2.2. Software Testing

According to the 24765-2017-ISO/IEC/IEEE International Standard-Systems and software engineering–Vocabulary [13], “test” is defined as “an activity in which a system is executed under specified conditions, the results are observed or recorded, and an evaluation is made of some aspect of the system or component.” With the rapid development of the software industry, the demand to improve this activity using effective techniques and reduce the risk of costly catastrophes that could harm the public has become a prominent focus [14].

Testing can be performed at different stages of a software product’s lifecycle and can address the entire system or part of it. Units, integration, systems, and regression testing are well-known testing strategies. Whichever strategy is employed at any given time, test cases will likely need to be generated, which is arguably one of the most intensive software testing activities [15]. Consequently, test case generation has been extensively studied, leading to the development of various techniques and tools for automatic or semi-automatic generation of test suites [16].

Test case generation is the process of producing a test suite from an information artifact using a generation mechanism that is typically guided by a test data-adequacy criterion. After generating the test case, it was executed on the system under test and evaluated using a test oracle to determine whether it passed or failed. Information artifacts can take several forms and are a source of meaningful test cases. It can be a formal specification, design document, or a source code. The generation mechanism is an algorithm or strategy that produces test cases based on information artifacts. The test data adequacy criterion measures and guides the quality of generated test cases. Finally, the test oracle was used to evaluate the correctness of the outcome of a given test execution on the system under test [16].

Many studies have explored the intersection between LLMs and test case generation, where LLMs often serve as the primary generation mechanism. While this might appear to be a straightforward code generation task, in the context of software testing, generating a diverse set of meaningful test inputs for enhanced code coverage and the need to validate the generated test cases present unique characteristics, which have led to innovative hybrid approaches in which LLMs are combined with various testing tools and methodologies [9].

2.3. Applications of LLMs in Test Case Generation

The integration of LLMs into software quality assurance has expanded the possibilities for automating tasks, such as test generation, offering significant improvements in efficiency and reliability [17]. One of the main focuses of software testing research has been on automated test suite generation, which establishes valuable benchmarks for comparing novel LLM-based and hybrid techniques [12]. In the realm of test case generation, traditional methods such as search-based, constraint-based, and random-based strategies aim to maximize coverage but often produce tests lacking diversity and meaningfulness, whereas LLMs, with their demonstrated success in code generation, offer a promising alternative by enabling the creation of diverse test cases, improving coverage, and fostering collaboration through natural language-based test case generation [9].

For example, the role of prompt design has been explored in studies such as [18], which investigated the capabilities of ChatGPT (GPT-3.5) in generating unit test suites using simple prompts. This study emphasizes the model’s out-of-the-box performance without incorporating feedback or adjustments, shedding light on its potential and limitations in test case coverage, correctness, and bug detection. Beyond prompt design, feedback-driven approaches, such as ChatUniTest [19], incorporate validation and repair, which enhance the quality of test cases by dynamically refining them based on detected errors.

Other approaches leverage model fine-tuning and pre-training to align LLMs with software testing requirements. One example is CAT-LM [20], a specialized LLM pretrained on Python and Java projects that captures the relationship between the code and its associated tests, which enables the generation of highly contextualized and precise test cases. Additionally, hybrid methods such as CODAMOSA [21] combine the strengths of LLMs and Search-Based Software Testing (SBST), using LLMs to generate test cases when tests generated by SBST reach a coverage plateau, which demonstrates how these two paradigms can work synergistically. These diverse strategies illustrate how LLMs are utilized to address the challenges in test case generation while also opening avenues for integrating them with existing tools and workflows.

Reflecting this growing momentum, survey efforts have also emerged to consolidate research in this area. Wang et al. [9] provide a comprehensive review of 102 studies covering a wide range of software testing tasks, from unit test generation to program repair. More recently, Zhang et al. [22] conducted a systematic review focused on unit testing, analyzing 105 studies up to March 2025 (arXiv:2506.15227). The emergence of such surveys underscores both the rapid growth of this domain and the increasing need for synthesized perspectives to guide future research. Collectively, these developments point to a field that is rapidly evolving but continues to face open challenges.

Despite their versatility and promise, the current research has several limitations. For instance, Yi et al. [23] found that ChatGPT-generated tests often exhibit improved readability but do not always surpass EvoSuite-generated tests in evaluation. In contrast, Elvira et al. [24] reported achieving 100% coverage using ChatGPT; however, the evaluation lacked comprehensiveness, making it difficult to generalize the findings across scenarios. Additionally, Chen et al. [25] and Yuan et al. [26] underscored concerns regarding computational costs and performance, which impact the practicality of using LLMs for test case generation. These challenges highlight the need for further refinement and tailored strategies, motivating a review to consolidate existing research, identify trends, and guide future work in this emerging field.

3. Methodology

To conduct and report this review, the guidelines for systematic literature reviews proposed by Kitchenham et al. [27] were followed. This framework ensures a comprehensive and structured approach for identifying and analyzing research on test case generation using LLMs. By applying this methodology, diverse strategies, tools, methodologies, and trends were investigated. In addition, this approach facilitates the identification of research gaps and opportunities for further exploration.

3.1. Research Questions

This section presents the findings of the review, structured around the three research questions that guided this study.

RQ-1: What methods have been proposed for using LLMs in automated test case generation?
RQ-2: How effective are LLMs in improving the quality and efficiency of test case generation compared to traditional methods?
RQ-3: What future directions have been identified for using LLMs in automated test case generation?

3.2. Information Sources

This review focuses primarily on studies that utilize LLMs to generate test cases. To identify the relevant studies at the intersection of these domains, experiments were conducted using various search strings. One of the primary sources, IEEE Xplore, provides a “command search” functionality that enables the combination of terms with logical operators such as AND and OR. Using this feature, the following search string was constructed: ((“test generation” OR “automated unit test generation” OR “automated test case generation” OR “software testing”) AND (“LLM” OR “large language models” OR “GPT” OR “ChatGPT” OR “T5”)).

The development of this search string was informed by experimental searches conducted in IEEE Xplore. These experimental searches involved testing a range of common terms, acronyms, and specific keywords related to LLMs and test case generation. This iterative process ensured that the search string comprehensively captured relevant literature without omitting key studies that might have employed alternative terminologies. This refinement enabled the systematic identification and review of studies pertinent to test case generation using LLMs, providing a robust foundation for the review.

Specific LLM names, such as GPT, ChatGPT, and T5 were included because experimentation has revealed that some papers explicitly reference these terms in their titles. Without these inclusion criteria, the key studies may have been overlooked. Quotation marks were deliberately used to ensure that searches returned exact matches for phrases such as “software testing,” thereby avoiding irrelevant results that merely contain the words “software” and “testing” separately.

In addition to the structured database search, 20 papers identified by Wang et al. [9] under the category of “Unit test case generation” were reviewed, and 14 were selected for inclusion in the primary studies due to their relevance. Several of these were sourced from Arxiv and overlapped with results retrieved from IEEE Xplore. Studies identified through the main search string were excluded from this supplementary set to avoid duplication.

Litmaps was then used to expand the set of relevant studies. The initial 26 studies identified through IEEE Xplore and manual screening were imported into Litmaps. The “Explore Related Articles” feature was leveraged to discover additional papers based on citation relationships and thematic similarity. To refine the search further, certain articles were marked as “More Like This,” prompting Litmaps to prioritize similar studies. Each newly surfaced paper was evaluated using the same inclusion and exclusion criteria as those applied to the initial set. This process was repeated iteratively until no additional relevant studies were identified, resulting in a set of 76 primary studies. Litmaps was used to supplement, rather than replace, the structured search strategy to ensure transparency and methodological consistency.

To ensure additional coverage of recent work, a targeted supplemental search was conducted in ACM Digital Library and Scopus, restricted to papers published in 2025. The query used for ACM DL was (“llm” AND “test case”), returning 261 records, and the query for Scopus was “llm test case generation,” returning 77 records. Titles and abstracts were screened for relevance, followed by full-text review when appropriate. After applying the same inclusion and exclusion criteria as in the original workflow and removing duplicates, 8 unique studies were identified and added to the primary set. This brought the final number of included studies to 84.

3.3. Eligibility Criteria

The eligibility criteria in this review were established to select studies that corresponded to the research goals, incorporating both the inclusion and exclusion criteria.

3.3.1. Inclusion Criteria

Publication type: Accepted publications included conference papers, journal articles, and preprint papers.
Language: Only studies written in English were considered.
Relevance: Studies need to explicitly focus on the application of LLMs in test case generation.
Time Frame: The review was restricted to studies published between 2022 and 2025. One study from 2021 [28] was also included due to its particular relevance to the topic.

3.3.2. Exclusion Criteria

Irrelevance: Studies that did not focus primarily on test case generation using LLMs were excluded.

3.4. Data Collection Procedure

Following the execution of the defined search string, data filtering was applied to refine the search results. Only studies published between the years 2022 and 2025 were considered for inclusion. This time frame was selected because, while LLMs have been studied for some time, their popularity has surged following the release of ChatGPT in November 2022. This trend is reflected in the literature, as the majority of relevant papers were published within this period. To ensure the review captured the latest developments, studies published between January 2022 and July 2025 were included. An earlier study from 2021 by Tufano et al. [28] was also incorporated due to its significant relevance to the topic, despite falling outside the specified timeframe.

Applying the year filter reduced the number of results to 38. This set was further narrowed by reviewing the titles, abstracts, and key sections of each study to assess their relevance. Through this process, 12 studies that directly aligned with the inclusion criteria were identified and selected as primary studies.

14 additional studies were then incorporated from the work of Wang et al. [9], who categorized 20 studies under the topic of “Unit test case generation.” Studies already identified in the IEEE Xplore search were excluded to avoid duplication.

To expand the initial set of 26 studies, Litmaps was used. The studies were imported into the platform, and its “Explore Related Articles” and “More Like This” features were employed to identify additional relevant papers based on citation and topical similarity. Each recommended study was evaluated against the predefined inclusion and exclusion criteria. This iterative process continued until no further eligible studies were identified. As a result, the total number of included primary studies expanded to 76.

To complement this set with the most recent work, a targeted supplemental search was conducted in the ACM Digital Library and Scopus databases, focusing exclusively on studies published in 2025. The query used for ACM was (“llm” AND “test case”), which returned 261 results. The query used for Scopus was “llm test case generation,” yielding 77 results. After applying the same inclusion and exclusion criteria, and removing duplicates, 8 additional studies were selected and incorporated into the final set.

The final result consisted of 84 primary studies. Figure 1 provides an overview of the literature search and inclusion process.

3.5. Results

The search across multiple databases led to the identification of 84 primary studies relevant to LLM-based test case generation. These studies have addressed various aspects of LLM utilization, including prompt engineering, iterative refinement, fine-tuning, and hybrid approaches. To enhance clarity, the selected studies were classified based on their methodological focus.

Figure 2 illustrates the distribution of the selected studies from 2021 to 2025, reflecting a noticeable surge in research activity following the release of ChatGPT. This visualization highlights the growing interest in LLM-based test generation and reveals the key research trends in the field.

Figure 3 categorizes the selected studies into four major groups: prompt engineering, feedback-driven approaches, model fine-tuning and pre-training, and hybrid approaches. This classification provides a structured perspective on the research landscape and enables more focused analysis using different methodological approaches.

Prompt Engineering: Studies in this category explore techniques for constructing and refining LLM prompts to improve test generation outcomes.
Feedback-driven Approaches: This category includes research that incorporates iterative interactions and mechanisms to further refine and enhance the relevance and quality of the generated test cases.
Model Fine-tuning and Pre-training: Studies in this group investigated fine-tuning and pre-training approaches aimed at optimizing LLM performance for test generation tasks.
Hybrid Approaches: Research in this category has examined how LLM-based test generation can be combined with traditional testing tools and methodologies.

Figure 3. Categorization of LLM-based test case generation methods identified in primary studies.

Figure 4 shows the most commonly used datasets in the reviewed studies. Variants of the datasets (e.g., enhanced or language-specific versions of HumanEval) were grouped under a unified label for consistency. Proprietary and custom-built datasets, as well as those that appeared in fewer than four studies, were excluded to emphasize the benchmarks that are more widely adopted. Although not exhaustive, the chart highlights datasets that have become standard in training and evaluating LLM-based test generation approaches.

Figure 5 presents a condensed overview of the most frequently used models across the five major LLM families identified in the primary studies: GPT (OpenAI), LLaMA, DeepSeek, CodeGen, and Codex. To reduce the visual complexity and emphasize key trends, the chart displays the top three models from each family. OpenAI’s GPT family clearly dominates the distribution, with gpt-3.5-turbo being the most frequently used individual model, followed by GPT-4, and GPT-4o. The LLaMA family ranks second in frequency, with notable usage of both general-purpose LLaMA models and code-specialized variants, such as CodeLLaMA. DeepSeek follows with several of its Coder models included, while CodeGen and Codex appear with lower but comparable levels of usage. Among the Codex variants, only code-davinci-002 was included in the chart, as its frequency surpassed that of the other variants, including code-cushman-001, code-cushman-002, and code-davinci-001. Although other models, such as Gemini, Claude, BART, and T5, were also used in the studies, they were less frequently adopted.

It is important to note that the frequency of model usage does not imply a superior performance. For example, gpt-3.5-turbo appears more frequently than GPT-4o, but this may reflect its earlier availability and broader API access at the time many studies were conducted. As newer and more capable models become more widely accessible, the frequency distribution can be expected to shift in future research.

4. Findings

The application of LLMs to automated test case generation is an emerging area of research that leverages their advanced natural language processing capabilities to address challenges in traditional software testing methods. The reviewed studies highlight a variety of approaches, including prompt design, feedback-driven methods, fine-tuning and pre-training, and hybrid techniques that combine LLMs with established software testing tools. These methods aim to improve the efficiency, scalability, and quality of test generation across diverse scenarios from fully automated unit test generation to LLMs that act as coding assistants.

While the potential of LLMs in this domain is evident, most studies have focused on specific use cases or benchmarks such as HumanEval, SF110, or Quixbugs, which may limit the generalizability of their findings across broader software engineering practices. Moreover, the evaluation criteria for LLM-generated tests, including metrics such as code coverage, mutation score, and compilability, vary across studies, posing challenges for consistent comparison. Prompts also differ significantly, with some studies extensively focusing on optimizing prompts for ultimate performance, while others test the out-of-the-box capabilities of LLMs. Another common theme is the need to address computational inefficiencies, generalizability across tools and languages, and the handling of complex program logic or large codebases.

This section synthesizes key insights from the reviewed studies, addressing the methods proposed for using LLMs in test case generation (RQ-1), their effectiveness compared to existing testing techniques (RQ-2), and future directions for advancing the integration and application of LLMs in automated software testing (RQ-3).

4.1. RQ-1: What Methods Have Been Proposed for Using LLMs in Automated Test Case Generation?

The reviewed studies presented a diverse range of methods for employing LLMs in automated test case generation, categorized into four main approaches: Prompt Design and Engineering, Feedback-driven Approaches, Model Fine-tuning and Pre-training, and Hybrid Approaches.

4.1.1. Prompt Design and Engineering

This category emphasizes the impact of prompt structure and content on the effectiveness of test case generation. Studies have demonstrated that carefully tailored prompts and embedding domain-specific information, such as bug report details or code context, significantly enhance test case quality and relevance. For instance, Li et al. [29] experimented with varying prompt structures and parameters to optimize the test outputs, whereas Zhang et al. [30] embedded domain knowledge into prompts to improve security test generation. Some studies have tested generic untuned prompts to evaluate the baseline capabilities of LLMs, contrasting simplicity with specificity. A consistent finding across these studies is that well-designed prompts improve performance metrics such as accuracy, test coverage, and fault detection, underscoring the centrality of prompt engineering in leveraging LLMs for software testing.

Table 1 summarizes the key studies that have investigated the role of prompt engineering in test generation. The grouped findings highlight how strategies, such as prompt chaining, few-shot examples, domain-specific tailoring, and structured context inputs, enhance test coverage, readability, and semantic correctness. These limitations reveal common challenges, including hallucinated code, compilation errors, and limited scalability to real-world environments, particularly in settings where the project context is sparse or incomplete.

4.1.2. Feedback-Driven Approaches

Feedback-driven methods focus on iterative refinement to enhance the quality and reliability of LLM-generated test cases. These approaches incorporate structured prompting, error analysis, and repair mechanisms to address the shortcomings of initial outputs. For example, LIBRO [53] used feedback loops and stack traces to refine bug reproduction tests. Similarly, ChatUniTest [19] uses a generation-validation-repair cycle to address errors and improve coverage. Such iterative strategies effectively integrate user or automated feedback, ensuring that test cases align with program requirements and provide meaningful coverage.

Table 2 presents studies that employ feedback loops, symbolic execution guidance, program analysis, and hybrid prompting strategies to iteratively improve test quality. These studies demonstrate how LLMs can be steered through runtime insights, static analysis, or user-driven correction mechanisms to produce more accurate, executable, and coverage-optimizing test cases. The table also highlights common limitations such as sensitivity to noisy feedback, limitations in symbolic constraint solving, and the cognitive gap between simulated and real-world user interactions, as exemplified in studies such as by Lahiri et al. [54].

4.1.3. Model Fine-Tuning and Pre-Training

The research conducted in this category focuses on optimizing LLMs for test case generation through tailored training approaches. These studies combined pre-training on large datasets with fine-tuning of domain-specific data to enhance the performance and contextual relevance. For example, the development of ATHENATEST [28] involved pre-training a sequence-to-sequence model on English and source code and then progressively incorporating contextual details for unit test generation, with subsequent work refining assert statement generation. Similarly, A3Test [78] uses domain adaptation, pre-training on assertions, and fine-tuning for test case generation, while ensuring naming consistency and test signature verification. Rao et al. [20] and Shin et al. [79] further extended these techniques by leveraging project-level adaptation and expanding context windows to capture code-test relationships. These approaches underscore the value of customized training techniques in improving LLM efficacy for domain-specific and complex tasks.

Table 3 consolidates studies that leverage fine-tuning and pre-training to improve LLM test generation capabilities. The grouped findings demonstrate gains in test quality, assertion accuracy, and coverage when LLMs are adapted using project-specific datasets, assertion knowledge, or multilingual corpora. However, the limitations point to recurring concerns, such as high dependency on focal context, reliance on heuristics for alignment, and reduced applicability in settings with missing test data or incompatible infrastructure.

4.1.4. Hybrid Approaches

The methods in this category combine the strengths of LLMs with established methodologies to address the limitations of standalone tools. Techniques, such as search-based testing, mutation testing, differential testing, and reinforcement learning, are included in this category. For example, mutation testing has been employed in systems such as MuTAP [88], which refines test cases by incorporating surviving mutants into augmented prompts to improve bug detection. Similarly, reinforcement learning enhances LLM-generated test cases, with reward-based feedback systems reducing test smells and promoting adherence to best practices [89]. Tools such as CODAMOSA [21] combine search-based software testing with LLMs to address stalling in coverage, iteratively optimizing test generation through integration with mutation frameworks. These hybrid strategies underscore the potential of combining LLMs with established methods to create robust and efficient test generation tools.

Table 4 highlights the hybrid techniques that integrate LLMs with reinforcement learning, search-based testing, symbolic execution, and other structured testing methodologies. The findings show that LLMs, when guided by mutation testing, coverage heuristics, or ensemble filtering, can surpass one-shot performance by better aligning with domain-specific testing goals. These limitations emphasize challenges in generalizing hybrid frameworks, managing performance bottlenecks, and scaling interactive or multi-agent strategies across broader software domains.

4.2. RQ-2: How Effective Are LLMs in Improving the Quality and Efficiency of Test Case Generation Compared to Traditional Methods?

To evaluate the effectiveness of LLMs in test case generation, the studies were grouped into three categories: Improvement over Existing Tools, No Clear Improvement over Existing Tools, and Mixed or Context-Dependent Outcomes. Some studies document advancements in metrics such as code coverage, test correctness, and usability, demonstrating the potential of LLMs to enhance test generation workflows. Conversely, other studies have revealed performance gaps, with traditional tools such as EvoSuite outperforming LLMs in areas such as compilation success and assertion precision. The mixed-outcomes category further underscores the variability in LLM performance, showing that their effectiveness often hinges on factors such as prompt design, integration strategies, and the specific context of their application. Overall, these categories provide a view of the current state of LLM-driven test case generation, offering insights into their strengths, limitations, and opportunities for further improvement.

4.2.1. Improvement over Existing Tools

This category highlights studies in which LLM-based approaches demonstrate clear advantages over traditional methods and benchmarks in test case generation. For example, [82] demonstrated the effectiveness of pre-training and fine-tuning techniques, reporting an 80% improvement over ATLAS and a 33% improvement over T5 in the top-1 accuracy for test generation tasks. Similarly, [88] emphasized the strength of mutation testing in refining LLM-generated test cases, with MuTAP outperforming Pynguin and the standalone LLMs in bug detection capabilities. CAT-LM [20] also achieved notable gains, specifically in CodeBLEU scores and exact match accuracy, significantly outperforming traditional tools, such as TeCo. Moreover, many studies have reported qualitative benefits such as increased test readability and usability. Developers often preferred LLM-generated test cases, citing their clarity and contextual relevance, as observed in [23,26]. Techniques such as reinforcement learning [89] and differential prompting [90] further demonstrate the potential of LLMs to generate tests that closely align with best practices and achieve superior quality metrics. These studies document noteworthy advancements in metrics, such as code coverage and test correctness, showcasing the potential of LLMs in generating high-quality test cases.

A study by Chowdhury et al. [52] further reinforces this trend by demonstrating that augmenting LLM prompts with static program analysis can lead to substantial performance improvements. In their evaluation on a commercial Java project, test generation success with LLaMA 7B rose from 36% using a baseline prompt to 99% using their static analysis–guided prompt, representing a 175% improvement. Similar gains were observed across other models, underscoring how prompt refinement through structural code analysis can dramatically enhance generation rates while reducing average input length by 90%, from 5295 to 559 tokens.

In a comparative evaluation focused on JavaScript, Godage et al. [51] demonstrated that Claude 3.5 outperformed other LLMs and traditional tools by achieving a test success rate of 93.33%, statement coverage of 98.01%, and a mutation score of 89.23%. These results highlight Claude 3.5′s ability to deliver highly effective and robust test cases in real-world scenarios, surpassing the performance of GPT-4o and other state-of-the-art models.

In addition to these gains, several hybrid approaches also show marked improvements over traditional tools. For example, Zhang et al. [92] demonstrated that combining chain-of-thought prompting with reinforcement learning feedback led to an 88% increase in both line and branch coverage, and more than double the number of bugs detected compared to GPT-3.5 and StarCoder. Similarly, Yang et al. [100] showed that augmenting evolutionary testing with LLM-generated seed inputs and repair strategies improved test coverage on 26% of Python modules and 8% of HumanEval tasks, while resolving a high percentage of test failures. Additionally, the test suite repair process successfully resolved 67.69% of errors in the Python module dataset and 82.32% in HumanEval after at most five LLM-based repair attempts. Ouédraogo et al. [103] introduced BRMiner, which extracted inputs from bug reports to enhance tools like EvoSuite and Randoop, resulting in a 13 percentage point increase in branch coverage and the detection of 58 previously missed bugs across Defects4J projects, while also reducing the test case count by 21%. These results collectively underscore the potential of hybrid LLM techniques to enhance both the effectiveness and efficiency of test generation.

4.2.2. No Clear Improvement over Existing Tools

This category highlights studies in which LLM-generated test cases did not demonstrate significant advantages compared to traditional methods, such as EvoSuite. For instance, Siddiq et al. [32] found that EvoSuite consistently outperformed LLMs in metrics such as code coverage and correctness, generating 160 compilable unit tests compared to 130 produced by GPT-3.5-Turbo. Similarly, Yi et al. [23] reported that ChatGPT’s performance in mutation analysis metrics did not provide a significant advantage over EvoSuite, underscoring the limitations in fault detection capabilities. Additionally, Tang et al. [18] observed that, while ChatGPT-generated tests were highly readable, EvoSuite demonstrated greater precision in assertion correctness and achieved superior code coverage. Although these studies acknowledge certain benefits of using LLMs in test generation, such as improved test readability, they also report critical issues including poor code coverage, lower compilability, and reduced correctness. These findings reveal the limitations of LLM-generated test cases and underscore the need for further refinement, enhanced methodologies, and better integration with traditional testing tools to ensure a consistent and reliable performance. Even studies showing high performance, such as the work by Godage et al. [51] on Claude 3.5, acknowledge the limitation of being restricted to JavaScript, suggesting that further validation across other programming languages is needed to confirm generalizability.

4.2.3. Mixed/Context-Dependent Outcomes

The Mixed/Context-dependent Outcomes category highlights studies in which the effectiveness of LLM-generated test cases varies based on specific context and implementation. While these studies demonstrate strengths such as enhanced readability, the ability to understand business logic, and improved collaboration in certain workflows, their performance is not consistently superior to traditional methods across all benchmarks. Some studies have also noted challenges, such as inconsistent test coverage, compilation issues, and the lack of comprehensive evaluations, emphasizing the need for tailored approaches and further empirical validation. As demonstrated by Zhang et al., ChatGPT outperformed traditional security testing tools, such as SIEGE and TRANSFER, in detecting vulnerabilities, although it faced challenges in generating executable tests in certain cases [30]. Similarly, Bhatia et al. observed that ChatGPT’s performance varied across code types, with iterative refinements significantly improving statement coverage for function and class-based code but yielded less consistent results for procedural scripts [60]. Expanding on this variability, Yu et al. highlighted the strengths of LLMs in understanding business logic and facilitating collaboration within teams while also noting limitations such as difficulties in retaining contextual memory and generating scripts for complex scenarios [55]. Li et al. [50] further emphasize this variability by introducing the follow-up question method, where LLMs are prompted to re-evaluate the code after an initial bug detection attempt. This simple interaction boosted bug localization capabilities in models such as GPT-3.5 and GPT-4, revealing a form of self-correction. However, not all models benefited as ERNIE Bot showed no gain due to an already high baseline, and SparkDesk and New Bing exhibited limited responsiveness to iterative prompts. Additionally, despite these improvements, test case quality across all models remained modest, with the best-performing model achieving only 40% on the test case quality composite metric, and some models like New Bing generating non-executable tests for all 12 software packages. These results underscore how model architecture, interaction style, and prompt design can significantly influence LLM performance in practice. Together, these studies illustrate the context-dependent nature of LLM performance, influenced by factors such as training, lack of feedback mechanisms, code structure, prompt design, and application domain.

4.3. RQ-3: What Future Directions Are Suggested in the Literature for Using LLMs in Test Case Generation?

The future directions identified in this review stem from the recurring themes and challenges observed in the examined studies. While some studies explicitly propose actionable recommendations, others focus on addressing the limitations or gaps encountered in their methodologies. These future directions are organized into five categories, each representing a critical dimension for enhancing LLM-driven test generation. Collectively, they provide a cohesive framework for advancing this field, with the aim of broadening its applicability, overcoming practical constraints, and inspiring innovative methodologies.

4.3.1. Hybrid Approaches and Integration with Existing Tools

Studies have highlighted the potential of combining LLM-based techniques with traditional software testing tools to create more robust and practical solutions. For example, Lemieux et al. [21] proposed integrating LLMs with search-based software testing frameworks, as demonstrated in CODAMOSA, which dynamically prompts LLMs during stall phases to improve test coverage. Similarly, Tufano et al. [82] suggested embedding LLMs as IDE plugins and combining them with existing test case generation tools, such as EvoSuite, to collaboratively generate or refine test cases. These hybrid approaches aim to leverage the strengths of both paradigms to optimize coverage, streamline workflows and improve the overall practicality of test generation.

4.3.2. Prompt Engineering

Future research in prompt engineering suggests refining input prompts to enhance the LLM performance. As proven by Differential Prompting [90], iterative adjustments improve failure-inducing test case identification. Meanwhile, [54] highlighted the value of scenario-based test generation, suggesting that tailored prompts could address challenges such as variability in outputs and alignment with user expectations. Advanced prompting strategies, such as dynamic prompts that are adapted based on intermediate outputs or contextual changes, are also promising for improving LLM-generated tests.

4.3.3. Project-Specific Knowledge

Integrating domain-specific or project-level knowledge into LLM workflows is a key strategy to improve their relevance and effectiveness. The authors of [19] proposed augmenting ChatUniTest with static analysis to extract project-specific details, thereby enhancing the context of the generated tests. Similarly, the authors of [79] emphasized the importance of domain adaptation by tailoring CodeT5 to specific projects, which significantly improved the test coverage metrics. Incorporating fine-tuning with application-specific data, such as bug reports and usage logs, was also highlighted in [53] as a means of generating tests that align better with project requirements.

4.3.4. Scalability and Performance

Addressing computational inefficiencies and scalability issues is critical for LLM-driven test case generation. Alagarsamy et al. [78] demonstrated significant efficiency gains with their A3Test framework, reducing test generation time by 97.2% (to 2.9 h) compared to ATHENATEST, showcasing LLMs’ potential to accelerate testing workflows. Similarly, Bayri and Demirel [37] found that ChatGPT-powered test generation streamlined the software development lifecycle by offloading test creation tasks, enabling developers to focus on design and user experience. However, challenges persist, as Chen et al. [25] identified increased computational costs due to iterative test case generation and refinement in their CODET methodology. Yuan et al. [26] proposed mitigations, such as integrating locally deployed models into their ChatTester framework and compiling only generated tests, reducing API call overheads by up to 30% in large projects. These findings highlight that while LLMs enhance efficiency in automated testing by reducing manual effort, optimizing deployment is essential to address computational bottlenecks for scalable, practical applications.

4.3.5. Expanding Applicability: Languages, Models, Metrics and Benchmarks

Expanding the scope of LLM-based test generation involves broadening its applicability to additional programming languages, exploring alternative LLM architectures, and developing more comprehensive benchmarks. Rao et al. [20] proposed extending CAT-LM to support more programming languages and improve integrability across diverse projects. Dakhel et al. [88] emphasized the need to evaluate MuTAP using larger datasets and a wider range of programming languages. Additionally, Steenhoek et al. [89] highlighted the importance of exploring dynamic metrics such as code coverage and conducting evaluations on complex testing scenarios to better reflect real-world conditions.

5. Discussion

The rapid growth of LLM-based test generation research has led to a wide range of approaches and innovations; however, the field remains fragmented in key areas. To provide a clearer picture of the progress made and the significant challenges that remain, this section synthesizes insights from reviewed studies. It identifies emerging themes, persistent gaps, and concrete directions for future work that are often not explicitly stated in the primary literature but are essential for guiding the next phase of research and development.

5.1. Synergistic Potential of Combined Approaches

Across the four methodological categories, no single approach consistently achieved the best results across all the tasks, benchmarks, or domains. Instead, the most substantial performance gains emerge when these techniques are used together. For example, structured prompts often show improved effectiveness when paired with iterative feedback loops, and fine-tuned models perform more reliably when embedded within established testing workflows that use mutation- or search-based guidance. This suggests that the key to further advancement lies in building systems that orchestrate multiple strategies, rather than relying on isolated innovations. Future work should explore how to design modular test generation pipelines that combine these strengths in a flexible and extensible manner.

5.2. Fragmentation in Evaluation Practices

Although the field is evolving rapidly, the lack of standardized evaluation practices remains a significant barrier to progress. Studies have reported a range of performance metrics, including test coverage, mutation score, syntactic correctness, and qualitative preferences by developers. These metrics are often applied to narrow datasets, such as HumanEval, QuixBugs, or SF110, which vary in structure, scope, and difficulty. Without shared benchmarks and evaluation protocols, it is difficult to compare studies directly or conduct meta-analyses that lead to generalizable conclusions. A critical next step for the community is to develop unified evaluation frameworks that include diverse datasets, multiple programming languages, and comprehensive metrics that reflect both the test quality and practical relevance.

5.3. The Role and Risks of Contextual Information

The findings across studies consistently point to the value of providing contextual information to LLMs. This context may include test scaffolding, developer-written documentation, configuration files, or code dependency structures. When this information is available, LLMs tend to produce more accurate, complete, and executable test cases. However, the availability of this context varies significantly across projects, especially outside of controlled academic settings. In many real-world systems, particularly legacy or modular architectures, such contextual signals may be missing or difficult to infer. Future research could explore ways to make LLMs more resilient in low-context environments by enabling them to infer or request missing information. Integration with agentic tools that can access codebases, parse documentation, and dynamically retrieve relevant data may help bridge this gap and improve the reliability across a wider range of software projects.

One promising direction involves enhancing the ability of LLM-powered tools to parse and interpret complex software projects effectively. If these tools can extract relevant contexts with greater precision and efficiency, they can reduce the number of redundant or extraneous queries, ultimately decreasing token consumption. This improvement would lower the cost of usage and make agentic tools more accessible for frequent use. Reducing credit consumption is particularly important for teams with budget constraints or usage caps, and it would encourage the wider integration of LLM-based assistance within real-world development pipelines.

5.4. Real-World Applications and Industrial Evaluation

Despite the growing body of academic research, very few studies have assessed LLM-based test generation methods in real industrial environments. The few that have attempted this transition often report that the performance of these systems deteriorates when confronted with challenges typical of production environments. These challenges include outdated libraries, proprietary dependencies, unstructured legacy code and limited documentation. Therefore, the current reliance on simplified benchmarks may overestimate the practical value of the proposed method. To address this, future studies should include comprehensive industrial case studies that evaluate LLMs within actual development pipelines. Partnerships with industry could also play a crucial role in understanding the true potential and limitations of these methods.

5.5. Gaps in Tool Ecosystem and Language Coverage

Although large language models can generate code in a wide range of programming languages, the tooling used to validate, execute, or refine LLM-generated test cases is often limited to a small subset of languages, particularly Java and Python. For example, many studies use tools such as EvoSuite or Pynguin, which are well established for Java and Python, respectively. However, developers working in other ecosystems, such as TypeScript and C++, may find it difficult to apply these techniques without equivalent tools. This disparity restricts the generalizability and usefulness of hybrid approaches that rely on language-specific infrastructures. An important direction for future work is to expand the tool support across languages and platforms. This may involve creating new static or dynamic analysis tools or developing intermediate representations that are language-agnostic and can bridge the gap between LLM-generated outputs and existing testing frameworks.

5.6. Beyond Functional Correctness: Addressing Non-Functional Requirements

While much of the current research on LLM-based test generation centers on functional correctness and code coverage, modern software systems demand assurance across a broader set of quality attributes. These include performance under load, resistance to security vulnerabilities, adherence to accessibility standards and long-term maintainability. Traditional automated testing tools often struggle to account for these nonfunctional requirements because of their complex, context-dependent, and sometimes subjective nature.

However, LLMs offer unique advantages. Their ability to understand natural language and reason over both code and descriptive documentation enables them to generate test cases that address performance bottlenecks, simulate security threat scenarios, or validate accessibility rules based on standard guidelines. For example, LLMs could be prompted to create tests that explore edge cases under heavy usage, check for the presence of insecure patterns, and verify compliance with accessibility labels and keyboard navigation support.

As future systems increasingly operate in high-assurance and user-centric domains, the ability of LLMs to support testing for non-functional properties may become as important as their support for basic correctness. This opens a new frontier for LLM-based testing research, where test generation is not only about detecting logic errors but also about ensuring robust software design.

While this review focuses on general software test case generation, recent studies demonstrate LLMs’ potential in generating critical and environmentally realistic test scenarios for autonomous systems. For instance, Xu et al. [104] propose LLMTester, which uses a “generate-test-feedback” pipeline to create failure-inducing scenarios for decision-making policies, addressing the challenge of extreme test case generation such as edge cases in autonomous driving. Similarly, Duvvuru et al. [105] introduce AUTOSIMTEST, which generates small Uncrewed Aerial Systems (sUAS) scenarios with realistic environmental details like fog and wind, ensuring consistency with real-world conditions. In autonomous driving, Petrovic et al. [106] leverages LLMs to automate scenario generation in CARLA, handling complex automotive scenarios, including sensor data processing and actuator activation, with a time budget parameter to ensure timely task completion. These studies suggest that LLMs can extend beyond code-based testing to address complex, domain-specific challenges.

Collectively, these observations reveal that the field is undergoing a transition. It is evolving from isolated experimentation to more integrated, practical, and comprehensive approaches for test generation. As LLMs become more deeply embedded in development workflows, collaboration between industry and academia could enable faster progress by addressing foundational limitations, embracing standardization, and broadening the scope of testing to include real-world contexts and quality attributes. The insights outlined in this discussion offer a roadmap for moving beyond proof-of-concept studies and toward the development of robust LLM-driven testing workflows and ecosystems.

6. Threats to Validity

6.1. Threats to Internal Validity

Despite adhering to a structured and well-defined methodology for this review, certain limitations may have affected the study’s internal validity. In particular, the use of Litmaps, a citation network exploration tool, introduced an element of subjectivity in the literature discovery process. After identifying an initial set of 26 primary studies through manual searches on databases such as IEEE Xplore, these studies were imported into Litmaps to identify additional relevant work using its “Explore Related Articles” functionality, and the search was further refined by marking selected papers as “More Like This” to guide subsequent recommendations. Although this approach was effective in expanding the corpus to 76 primary studies, it relies on a semi-manual process that is not entirely reproducible. The identification of similar papers within Litmaps is influenced by user interaction and recommendation outputs that are not fully controlled through transparent or standardized query parameters.

This methodology introduces a potential threat to internal validity as the process of selecting follow-up papers can vary across researchers and sessions. To mitigate this risk, all studies identified through Litmaps were subjected to the same predefined inclusion and exclusion criteria as those found through structured database searches. Furthermore, the role of Litmaps in supplementing the search strategy is clearly documented to maintain transparency.

Another important source of internal validity threat lies in the classification of studies. The review grouped primary studies into four methodological categories: prompt design and engineering, feedback-driven approaches, model fine-tuning and pre-training, and hybrid approaches. It also grouped studies into three effectiveness categories. However, such categorizations inherently require interpretation and abstraction. Other researchers may opt for more granular or broader classifications depending on how they weigh overlapping contributions. In several cases, studies spanned multiple techniques but were placed in the category that best captured their primary contribution. Similarly, the classification of effectiveness, such as improvement over existing tools or mixed outcomes, was based on empirical results reported in the studies. These results were often contextual and varied in scope. These decisions, while made transparently and supported with evidence, introduce an element of subjectivity that could influence replicability.

Additionally, the review began with IEEE Xplore as the primary search database, with supplementary searches in other databases such as ACM Digital Library and Scopus conducted later to capture recent publications. While this approach was sufficient to identify a comprehensive set of studies, it may have influenced the topical emphasis of the review. Alternative starting points that placed equal initial weight on ACM, Scopus, or Google Scholar might have surfaced a different distribution of studies, particularly from domains or venues less represented in IEEE. Although steps were taken to broaden coverage through Litmaps and supplemental searches, the initial reliance on a single database introduces potential bias in the literature collection process.

Finally, the review did not apply a formal quality assessment protocol to evaluate or rank the included studies. All papers were included based on relevance and inclusion criteria without formally scoring their methodological rigor or publication type. As a result, conference papers, journal articles, and preprints were considered equally in the synthesis, despite differences in peer review and evaluation depth. While we highlighted major differences in study design and reporting throughout the discussion, the absence of a structured quality appraisal may affect the reliability of certain aggregated conclusions. Future reviews should consider using standardized quality assessment checklists to improve the transparency and validity of the study selection and interpretation process.

6.2. Threats to External Validity

The extent to which the findings of this review can be generalized beyond the included studies is limited by several factors. Most of the primary studies evaluated LLM based test case generation on academic or benchmark datasets such as HumanEval, QuixBugs, and SF110, which do not fully represent the complexity of real-world software projects. Consequently, the effectiveness, usability, and scalability reported in these settings may not translate directly to production environments, especially those involving legacy codebases, domain specific architectures, or limited documentation.

Furthermore, the reviewed literature predominantly focuses on Python and Java, largely due to the availability of testing tools and evaluation benchmarks in these languages. While this focus allows for consistent comparison across studies, it limits the applicability of the conclusions to less represented programming languages such as C++, Rust, TypeScript, or domain specific languages, where tooling and LLM support may be less mature. As a result, the generalizability of findings across diverse language ecosystems and industrial software contexts remains an open question. Future studies should evaluate LLM based test generation in broader settings to improve external validity and increase practical relevance.

7. Conclusions and Future Work

This review examines the use of LLMs in automated test case generation, categorizing the proposed methods into four key areas: prompt design and engineering, feedback-driven approaches, model fine-tuning and pre-training, and hybrid approaches. These methods harness LLM capabilities to automate test generation, optimize test coverage, and improve usability, leveraging strategies such as tailored prompts, iterative refinement, and integration with traditional testing tools.

The findings highlight promising advancements, with LLMs demonstrating improvements in metrics, such as code coverage, usability, and correctness. However, challenges such as inconsistent performance, compilation errors, and high computational demands persist and necessitate further refinement. These limitations underscore the importance of ensuring that LLM-driven approaches are scalable, reliable, and practical for diverse testing scenarios.

Future research directions suggest expanding LLM applicability across multiple programming languages, enhancing domain-specific knowledge integration, and exploring hybrid approaches to maximize their potential. Addressing computational inefficiencies and scalability concerns is critical. Additionally, ethical considerations, such as data privacy and bias mitigation, must be prioritized to ensure fair and responsible deployment of LLM-driven test generation systems.

In conclusion, LLMs hold significant promise for transforming software testing by automating test generation, improving test quality, and reducing the manual effort traditionally required in software development workflows. Although their potential has been demonstrated in various studies, challenges such as scalability, computational demands, and integration with existing tools highlight the need for continued research and innovation. Bridging the gap between academic advancements and industrial practices is crucial for validating and refining these methods for real-world applications. By addressing these challenges and building on the strengths of current approaches, LLMs can evolve into indispensable tools that enhance the efficiency and reliability of software testing processes.

Author Contributions

Writing—Original Draft Preparation, A.C.; Supervision and Writing—Review and Editing, Q.H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Putra, S.J.; Sugiarti, Y.; Prayoga, B.Y.; Samudera, D.W.; Khairani, D. Analysis of Strengths and Weaknesses of Software Testing Strategies: Systematic Literature Review. In Proceedings of the 2023 11th International Conference on Cyber and IT Service Management (CITSM), Makassar, Indonesia, 10–11 November 2023; pp. 1–5. [Google Scholar]
Gurcan, F.; Dalveren, G.G.M.; Cagiltay, N.E.; Roman, D.; Soylu, A. Evolution of Software Testing Strategies and Trends: Semantic Content Analysis of Software Research Corpus of the Last 40 Years. IEEE Access 2022, 10, 106093–106109. [Google Scholar] [CrossRef]
Pudlitz, F.; Brokhausen, F.; Vogelsang, A. What Am I Testing and Where? Comparing Testing Procedures Based on Lightweight Requirements Annotations. Empir. Softw. Eng. 2020, 25, 2809–2843. [Google Scholar] [CrossRef]
Kassab, M.; Laplante, P.; Defranco, J.; Neto, V.V.G.; Destefanis, G. Exploring the Profiles of Software Testing Jobs in the United States. IEEE Access 2021, 9, 68905–68916. [Google Scholar] [CrossRef]
De Silva, D.; Hewawasam, L. The Impact of Software Testing on Serverless Applications. IEEE Access 2024, 12, 51086–51099. [Google Scholar] [CrossRef]
Alshahwan, N.; Harman, M.; Marginean, A. Software Testing Research Challenges: An Industrial Perspective. In Proceedings of the 2023 IEEE Conference on Software Testing, Verification and Validation (ICST), Dublin, Ireland, 16–20 April 2023; pp. 1–10. [Google Scholar]
Aniche, M.; Treude, C.; Zaidman, A. How Developers Engineer Test Cases: An Observational Study. IEEE Trans. Softw. Eng. 2021, 48, 4925–4946. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2024, arXiv:2303.18223. [Google Scholar] [CrossRef]
Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
Chen, L.; Guo, Q.; Jia, H.; Zeng, Z.; Wang, X.; Xu, Y.; Wu, J.; Wang, Y.; Gao, Q.; Wang, J.; et al. A Survey on Evaluating Large Language Models in Code Generation Tasks. arXiv 2024, arXiv:2408.16498. [Google Scholar] [CrossRef]
Raiaan, M.A.K.; Mukta, M.d.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Fan, A.; Gokkaya, B.; Harman, M.; Lyubarskiy, M.; Sengupta, S.; Yoo, S.; Zhang, J.M. Large Language Models for Software Engineering: Survey and Open Problems. In Proceedings of the 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), Melbourne, Australia, 14–20 May 2023; pp. 31–53. [Google Scholar]
ISO/IEC/IEEE 24765:2017(E); ISO/IEC/IEEE International Standard—Systems and Software Engineering—Vocabulary. IEEE: New York, NY, USA, 2017; pp. 1–541. [CrossRef]
Mayeda, M.; Andrews, A. Evaluating Software Testing Techniques: A Systematic Mapping Study. In Advances in Computers; Missouri University of Science and Technology: Rolla, MO, USA, 2021; ISBN 978-0-12-824121-9. [Google Scholar]
Lonetti, F.; Marchetti, E. Emerging Software Testing Technologies. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2018; Volume 108, pp. 91–143. ISBN 978-0-12-815119-8. [Google Scholar]
Clark, A.G.; Walkinshaw, N.; Hierons, R.M. Test Case Generation for Agent-Based Models: A Systematic Literature Review. Inf. Softw. Technol. 2021, 135, 106567. [Google Scholar] [CrossRef]
Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv 2024, arXiv:2308.10620. [Google Scholar] [CrossRef]
Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. ChatGPT vs. SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359. [Google Scholar] [CrossRef]
Chen, Y.; Hu, Z.; Zhi, C.; Han, J.; Deng, S.; Yin, J. ChatUniTest: A Framework for LLM-Based Test Generation. arXiv 2024, arXiv:2305.04764. [Google Scholar] [CrossRef]
Rao, N.; Jain, K.; Alon, U.; Goues, C.L.; Hellendoorn, V.J. CAT-LM Training Language Models on Aligned Code And Tests. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 409–420. [Google Scholar] [CrossRef]
Lemieux, C.; Inala, J.P.; Lahiri, S.K.; Sen, S. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 919–931. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Gu, S.; Shang, Y.; Chen, Z.; Xiao, L. Large Language Models for Unit Testing: A Systematic Literature Review. arXiv 2025, arXiv:2506.15227. [Google Scholar] [CrossRef]
Yi, G.; Chen, Z.; Chen, Z.; Wong, W.E.; Chau, N. Exploring the Capability of ChatGPT in Test Generation. In Proceedings of the 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), Chiang Mai, Thailand, 22–26 October 2023; pp. 72–80. [Google Scholar] [CrossRef]
Elvira, T.; Procko, T.T.; Couder, J.O.; Ochoa, O. Digital Rubber Duck: Leveraging Large Language Models for Extreme Programming. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; pp. 295–304. [Google Scholar] [CrossRef]
Chen, B.; Zhang, F.; Nguyen, A.; Zan, D.; Lin, Z.; Lou, J.-G.; Chen, W. CodeT: Code Generation with Generated Tests. arXiv 2022, arXiv:2207.10397. [Google Scholar] [CrossRef]
Yuan, Z.; Lou, Y.; Liu, M.; Ding, S.; Wang, K.; Chen, Y.; Peng, X. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv 2024, arXiv:2305.04207. [Google Scholar] [CrossRef]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE: Keele, UK, 2007. [Google Scholar]
Tufano, M.; Drain, D.; Svyatkovskiy, A.; Deng, S.K.; Sundaresan, N. Unit Test Case Generation with Transformers and Focal Context. arXiv 2021, arXiv:2009.05617. [Google Scholar] [CrossRef]
Li, V.; Doiron, N. Prompting Code Interpreter to Write Better Unit Tests on Quixbugs Functions. arXiv 2023, arXiv:2310.00483. [Google Scholar] [CrossRef]
Zhang, Y.; Song, W.; Ji, Z.; Yao, D.; Meng, N. How well does LLM generate security tests? arXiv 2023, arXiv:2310.00710. [Google Scholar] [CrossRef]
Guilherme, V.; Vincenzi, A. An Initial Investigation of ChatGPT Unit Test Generation Capability. In Proceedings of the 8th Brazilian Symposium on Systematic and Automated Software Testing, Campo Grande, MS, Brazil, 25–29 September 2023; pp. 15–24. [Google Scholar] [CrossRef]
Siddiq, M.L.; Da Silva Santos, J.C.; Tanvir, R.H.; Ulfat, N.; Al Rifat, F.; Lopes, V.C. Using Large Language Models to Generate JUnit Tests: An Empirical Study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno Italy, 18–21 June 2024; pp. 313–322. [CrossRef]
Yang, L.; Yang, C.; Gao, S.; Wang, W.; Wang, B.; Zhu, Q.; Chu, X.; Zhou, J.; Liang, G.; Wang, Q.; et al. On the Evaluation of Large Language Models in Unit Test Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October 2024; pp. 1607–1619. [Google Scholar] [CrossRef]
Chang, H.-F.; Shirazi, M.S. A Systematic Approach for Assessing Large Language Models’ Test Case Generation Capability. arXiv 2025, arXiv:2502.02866. [Google Scholar] [CrossRef]
Xu, J.; Pang, B.; Qu, J.; Hayashi, H.; Xiong, C.; Zhou, Y. CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification. arXiv 2025, arXiv:2502.08806. [Google Scholar] [CrossRef]
Wang, Y.; Xia, C.; Zhao, W.; Du, J.; Miao, C.; Deng, Z.; Yu, P.S.; Xing, C. ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms. arXiv 2025, arXiv:2502.06556. [Google Scholar] [CrossRef]
Bayrı, V.; Demirel, E. AI-Powered Software Testing: The Impact of Large Language Models on Testing Methodologies. In Proceedings of the 2023 4th International Informatics and Software Engineering Conference (IISEC), Ankara, Türkiye, 21–22 December 2023; pp. 1–4. [Google Scholar] [CrossRef]
Plein, L.; Ouédraogo, W.C.; Klein, J.; Bissyandé, T.F. Automatic Generation of Test Cases based on Bug Reports: A Feasibility Study with Large Language Models. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 360–361. [Google Scholar] [CrossRef]
Heiko, K.; Virendra, A.; Soumyadip, B.; Chandrika, K.R. Automated Control Logic Test Case Generation using Large Language Models. arXiv 2024, arXiv:2405.01874. [Google Scholar] [CrossRef]
Yin, H.; Mohammed, H.; Boyapati, S. Leveraging Pre-Trained Large Language Models (LLMs) for On-Premises Comprehensive Automated Test Case Generation: An Empirical Study. In Proceedings of the 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 21–23 November 2024; pp. 597–607. [Google Scholar] [CrossRef]
Rao, N.; Gilbert, E.; Green, H.; Ramananandro, T.; Swamy, N.; Le Goues, C.; Fakhoury, S. DiffSpec: Differential Testing with LLMs using Natural Language Specifications and Code Artifacts. arXiv 2025, arXiv:2410.04249. [Google Scholar] [CrossRef]
Zhang, Q.; Shang, Y.; Fang, C.; Gu, S.; Zhou, J.; Chen, Z. TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models. arXiv 2024, arXiv:2409.17561. [Google Scholar] [CrossRef]
Jiri, M.; Emese, B.; Medlen, P. Leveraging Large Language Models for Python Unit Test. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence Testing (AITest), Shanghai, China, 15–18 July 2024; pp. 95–100. [Google Scholar] [CrossRef]
Ryan, G.; Jain, S.; Shang, M.; Wang, S.; Ma, X.; Ramanathan, M.K.; Ray, B. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. arXiv 2024, arXiv:2402.00097. [Google Scholar] [CrossRef]
Gao, S.; Wang, C.; Gao, C.; Jiao, X.; Chong, C.Y.; Gao, S.; Lyu, M. The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation. arXiv 2025, arXiv:2501.01329. [Google Scholar] [CrossRef]
Wang, W.; Yang, C.; Wang, Z.; Huang, Y.; Chu, Z.; Song, D.; Zhang, L.; Chen, A.R.; Ma, L. TESTEVAL: Benchmarking Large Language Models for Test Case Generation. arXiv 2025, arXiv:2406.04531. [Google Scholar] [CrossRef]
Ouédraogo, W.C.; Kaboré, K.; Li, Y.; Tian, H.; Koyuncu, A.; Klein, J.; Lo, D.; Bissyandé, T.F. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation. arXiv 2024, arXiv:2407.00225. [Google Scholar] [CrossRef]
Khelladi, D.E.; Reux, C.; Acher, M. Unify and Triumph: Polyglot, Diverse, and Self-Consistent Generation of Unit Tests with LLMs. arXiv 2025, arXiv:2503.16144. [Google Scholar] [CrossRef]
Sharma, R.K.; Halleux, J.D.; Barke, S.; Zorn, B. PromptPex: Automatic Test Generation for Language Model Prompts. arXiv 2025, arXiv:2503.05070. [Google Scholar] [CrossRef]
Li, Y.; Liu, P.; Wang, H.; Chu, J.; Wong, W.E. Evaluating large language models for software testing. Comput. Stand. Interfaces 2025, 93, 103942. [Google Scholar] [CrossRef]
Godage, T.; Nimishan, S.; Vasanthapriyan, S.; Palanisamy, V.; Joseph, C.; Thuseethan, S. Evaluating the Effectiveness of Large Language Models in Automated Unit Test Generation. In Proceedings of the 2025 5th International Conference on Advanced Research in Computing (ICARC), Belihuloya, Sri Lanka, 19–20 February 2025; pp. 1–6. [Google Scholar] [CrossRef]
Roy Chowdhury, S.; Sridhara, G.; Raghavan, A.K.; Bose, J.; Mazumdar, S.; Singh, H.; Sugumaran, S.B.; Britto, R. Static Program Analysis Guided LLM Based Unit Test Generation. In Proceedings of the 8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD), Jodhpur, India, 18–21 December 2024; pp. 279–283. [Google Scholar] [CrossRef]
Kang, S.; Yoon, J.; Yoo, S. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 2312–2323. [Google Scholar] [CrossRef]
Lahiri, S.K.; Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Musuvathi, M.; Choudhury, P.; von Veh, C.; Inala, J.P.; Wang, C.; et al. Interactive Code Generation via Test-Driven User-Intent Formalization. arXiv 2023, arXiv:2208.05950. [Google Scholar] [CrossRef]
Yu, S.; Fang, C.; Ling, Y.; Wu, C.; Chen, Z. LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities. In Proceedings of the 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), Chiang Mai, Thailand, 22–26 October 2023; pp. 206–217. [Google Scholar] [CrossRef]
Nashid, N.; Bouzenia, I.; Pradel, M.; Mesbah, A. Issue2Test: Generating Reproducing Test Cases from Issue Reports. arXiv 2025, arXiv:2503.16320. [Google Scholar] [CrossRef]
Chen, M.; Liu, Z.; Tao, H.; Hong, Y.; Lo, D.; Xia, X.; Sun, J. B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1693–1705. [Google Scholar] [CrossRef]
Ni, C.; Wang, X.; Chen, L.; Zhao, D.; Cai, Z.; Wang, S.; Yang, X. CasModaTest: A Cascaded and Model-agnostic Self-directed Framework for Unit Test Generation. arXiv 2024, arXiv:2406.15743. [Google Scholar] [CrossRef]
Liu, R.; Zhang, Z.; Hu, Y.; Lin, Y.; Gao, X.; Sun, H. LLM-based Unit Test Generation for Dynamically-Typed Programs. arXiv 2025, arXiv:2503.14000. [Google Scholar] [CrossRef]
Bhatia, S.; Gandhi, T.; Kumar, D.; Jalote, P. Unit Test Generation using Generative AI: A Comparative Performance Analysis of Autogeneration Tools. In Proceedings of the 1st International Workshop on Large Language Models for Code, Lisbon, Portugal, 20 April 2024; pp. 54–61. [Google Scholar] [CrossRef]
Schäfer, M.; Nadi, S.; Eghbali, A.; Tip, F. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Trans. Softw. Eng. 2024, 50, 85–105. [Google Scholar] [CrossRef]
Kumar, N.A.; Lan, A. Using Large Language Models for Student-Code Guided Test Case Generation in Computer Science Education. arXiv 2024, arXiv:2402.07081. [Google Scholar] [CrossRef]
Wang, Z.; Liu, K.; Li, G.; Jin, Z. HITS: High-coverage LLM-based Unit Test Generation via Method Slicing. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1258–1268. [Google Scholar] [CrossRef]
Etemadi, K.; Mohammadi, B.; Su, Z.; Monperrus, M. Mokav: Execution-driven Differential Testing with LLMs. arXiv 2024, arXiv:2406.10375. [Google Scholar] [CrossRef]
Yang, C.; Chen, J.; Lin, B.; Zhou, J.; Wang, Z. Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analysis. arXiv 2024, arXiv:2404.04966. [Google Scholar] [CrossRef]
Alshahwan, N.; Chheda, J.; Finegenova, A.; Gokkaya, B.; Harman, M.; Harper, I.; Marginean, A.; Sengupta, S.; Wang, E. Automated Unit Test Improvement using Large Language Models at Meta. arXiv 2024, arXiv:2402.09171. [Google Scholar] [CrossRef]
Pizzorno, J.A.; Berger, E.D. CoverUp: Effective High Coverage Test Generation for Python. Proc. ACM Softw. Eng. 2025, 2, 2897–2919. [Google Scholar] [CrossRef]
Jain, K.; Goues, C.L. TestForge: Feedback-Driven, Agentic Test Suite Generation. arXiv 2025, arXiv:2503.14713. [Google Scholar] [CrossRef]
Gu, S.; Nashid, N.; Mesbah, A. LLM Test Generation via Iterative Hybrid Program Analysis. arXiv 2025, arXiv:2503.13580. [Google Scholar] [CrossRef]
Straubinger, P.; Kreis, M.; Lukasczyk, S.; Fraser, G. Mutation Testing via Iterative Large Language Model-Driven Scientific Debugging. arXiv 2025, arXiv:2503.08182. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, X.; Lin, Y.; Gao, X.; Sun, H.; Yuan, Y. LLM-based Unit Test Generation via Property Retrieval. arXiv 2024, arXiv:2410.13542. [Google Scholar] [CrossRef]
Zhong, Z.; Wang, S.; Wang, H.; Wen, S.; Guan, H.; Tao, Y.; Liu, Y. Advancing Bug Detection in Fastjson2 with Large Language Models Driven Unit Test Generation. arXiv 2024, arXiv:2410.09414. [Google Scholar] [CrossRef]
Pan, R.; Kim, M.; Krishna, R.; Pavuluri, R.; Sinha, S. ASTER: Natural and Multi-language Unit Test Generation with LLMs. arXiv 2025, arXiv:2409.03093. [Google Scholar] [CrossRef]
Gu, S.; Zhang, Q.; Li, K.; Fang, C.; Tian, F.; Zhu, L.; Zhou, J.; Chen, Z. TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration. arXiv 2025, arXiv:2408.03095. [Google Scholar] [CrossRef]
Li, K.; Yu, H.; Guo, T.; Cao, S.; Yuan, Y. CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation. arXiv 2025, arXiv:2502.10802. [Google Scholar] [CrossRef]
Cheng, R.; Tufano, M.; Cito, J.; Cambronero, J.; Rondon, P.; Wei, R.; Sun, A.; Chandra, S. Agentic Bug Reproduction for Effective Automated Program Repair at Google. arXiv 2025, arXiv:2502.01821. [Google Scholar] [CrossRef]
Liu, J.; Li, C.; Chen, R.; Li, S.; Gu, B.; Yang, M. STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs. Proc. ACM Softw. Eng. 2025, 2, 2113–2135. [Google Scholar] [CrossRef]
Alagarsamy, S.; Tantithamthavorn, C.; Aleti, A. A3Test: Assertion-Augmented Automated Test Case Generation. arXiv 2023, arXiv:2302.10352. [Google Scholar] [CrossRef]
Shin, J.; Hashtroudi, S.; Hemmati, H.; Wang, S. Domain Adaptation for Code Model-Based Unit Test Case Generation. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1211–1222. [Google Scholar] [CrossRef]
Rehan, S.; Al-Bander, B.; Ahmad, A.A.-S. Harnessing Large Language Models for Automated Software Testing: A Leap Towards Scalable Test Case Generation. Electronics 2025, 14, 1463. [Google Scholar] [CrossRef]
He, Y.; Huang, J.; Rong, Y.; Guo, Y.; Wang, E.; Chen, H. UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; pp. 1061–1072. [Google Scholar] [CrossRef]
Tufano, M.; Drain, D.; Svyatkovskiy, A.; Sundaresan, N. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test, Pittsburgh, PA, USA, 17–18 May 2022; pp. 54–64. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Zheng, Y.; Zhang, Y.; Zhao, Y.; Huang, R.; Zhou, J.; Yang, Y.; Zheng, T.; Chen, Z. Improving Deep Assertion Generation via Fine-Tuning Retrieval-Augmented Pre-trained Language Models. ACM Trans. Softw. Eng. Methodol. 2025, 34, 3721128. [Google Scholar] [CrossRef]
Primbs, S.; Fein, B.; Fraser, G. AsserT5: Test Assertion Generation Using a Fine-Tuned Code Language Model. In Proceedings of the 2025 IEEE/ACM International Conference on Automation of Software Test (AST), Ottawa, ON, Canada, 28–29 April 2025. [Google Scholar] [CrossRef]
Storhaug, A.; Li, J. Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study. arXiv 2024, arXiv:2411.02462. [Google Scholar] [CrossRef]
Alagarsamy, S.; Tantithamthavorn, C.; Takerngsaksiri, W.; Arora, C.; Aleti, A. Enhancing Large Language Models for Text-to-Testcase Generation. arXiv 2025, arXiv:2402.11910. [Google Scholar] [CrossRef]
Shang, Y.; Zhang, Q.; Fang, C.; Gu, S.; Zhou, J.; Chen, Z. A Large-Scale Empirical Study on Fine-Tuning Large Language Models for Unit Testing. Proc. ACM Softw. Eng. 2025, 2, 1678–1700. [Google Scholar] [CrossRef]
Dakhel, A.M.; Nikanjam, A.; Majdinasab, V.; Khomh, F.; Desmarais, M.C. Effective test generation using pre-trained Large Language Models and mutation testing. Inf. Softw. Technol. 2024, 171, 107468. [Google Scholar] [CrossRef]
Steenhoek, B.; Tufano, M.; Sundaresan, N.; Svyatkovskiy, A. Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv 2025, arXiv:2310.02368. [Google Scholar] [CrossRef]
Li, T.-O.; Zong, W.; Wang, Y.; Tian, H.; Cheung, S.-C. Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 14–26. [Google Scholar] [CrossRef]
Liu, K.; Chen, Z.; Liu, Y.; Zhang, J.M.; Harman, M.; Han, Y.; Ma, Y.; Dong, Y.; Li, G.; Huang, G. LLM-Powered Test Case Generation for Detecting Tricky Bugs. arXiv 2024, arXiv:2404.10304. [Google Scholar] [CrossRef]
Zhang, J.; Hu, X.; Xia, X.; Cheung, S.-C.; Li, S. Automated Unit Test Generation via Chain of Thought Prompt and Reinforcement Learning from Coverage Feedback. ACM Trans. Softw. Eng. Methodol. 2025. [Google Scholar] [CrossRef]
Sapozhnikov, A.; Olsthoorn, M.; Panichella, A.; Kovalenko, V.; Derakhshanfar, P. TestSpark: IntelliJ IDEA’s Ultimate Test Generation Companion. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; pp. 30–34. [Google Scholar] [CrossRef]
Li, J.; Shen, J.; Su, Y.; Lyu, M.R. LLM-assisted Mutation for Whitebox API Testing. arXiv 2025, arXiv:2504.05738. [Google Scholar] [CrossRef]
Li, K.; Yuan, Y. Large Language Models as Test Case Generators: Performance Evaluation and Enhancement. arXiv 2024, arXiv:2404.13340. [Google Scholar] [CrossRef]
Huang, D.; Zhang, J.M.; Luck, M.; Bu, Q.; Qing, Y.; Cui, H. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv 2024, arXiv:2312.13010. [Google Scholar] [CrossRef]
Mündler, N.; Müller, M.N.; He, J.; Vechev, M. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. arXiv 2025, arXiv:2406.12952. [Google Scholar] [CrossRef]
Taherkhani, H.; Hemmati, H. VALTEST: Automated Validation of Language Model Generated Test Cases. arXiv 2024, arXiv:2411.08254. [Google Scholar] [CrossRef]
Lops, A.; Narducci, F.; Ragone, A.; Trizio, M.; Bartolini, C. A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites. arXiv 2024, arXiv:2408.07846. [Google Scholar] [CrossRef]
Yang, R.; Xu, X.; Wang, R. LLM-enhanced evolutionary test generation for untyped languages. Autom. Softw. Eng. 2025, 32, 20. [Google Scholar] [CrossRef]
Xu, J.; Xu, J.; Chen, T.; Ma, X. Symbolic Execution with Test Cases Generated by Large Language Models. In Proceedings of the 2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), Cambridge, UK, 1–5 July 2024; pp. 228–237. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, Q.; Liu, K.; Dou, W.; Zhu, J.; Qian, L.; Zhang, C.; Lin, Z.; Wei, J. CITYWALK: Enhancing LLM-Based C++ Unit Test Generation via Project-Dependency Awareness and Language-Specific Knowledge. arXiv 2025, arXiv:2501.16155. [Google Scholar] [CrossRef]
Ouédraogo, W.C.; Plein, L.; Kaboré, K.; Habib, A.; Klein, J.; Lo, D.; Bissyandé, T.F. Enriching automatic test case generation by extracting relevant test inputs from bug reports. Empir. Softw. Eng. 2025, 30, 85. [Google Scholar] [CrossRef]
Xu, W.; Pei, H.; Yang, J.; Shi, Y.; Zhang, Y.; Zhao, Q. Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach. arXiv 2024, arXiv:2412.06684. [Google Scholar] [CrossRef]
Duvvuru, V.S.A.; Zhang, B.; Vierhauser, M.; Agrawal, A. LLM-Agents Driven Automated Simulation Testing and Analysis of small Uncrewed Aerial Systems. arXiv 2025, arXiv:2501.11864. [Google Scholar] [CrossRef]
Petrovic, N.; Lebioda, K.; Zolfaghari, V.; Schamschurko, A.; Kirchner, S.; Purschke, N. LLM-Driven Testing for Autonomous Driving Scenarios. In Proceedings of the 2024 2nd International Conference on Foundation and Large Language Models (FLLM), Dubai, United Arab Emirates, 26–29 November 2024; pp. 173–178. [Google Scholar] [CrossRef]

Figure 1. Collection procedure.

Figure 2. Distribution of selected studies over the years.

Figure 4. Frequency of commonly used datasets in LLM-based test case generation studies.

Figure 5. Most frequently used models within major LLM families (percentages are rounded to one decimal place).

Table 1. Overview of primary studies on prompt design and engineering.

Study	Year	Finding	Limitation
Tang, Y. [18]	2023	Language models such as GPT-3.5-Turbo, GPT-4, and their variants generate unit tests that are readable and syntactically valid. Some models achieve competitive mutation scores under prompt tuning.	Many tests generated by these models fail due to compilation errors, hallucinated symbols, and incorrect logic. Performance drops on arithmetic, loops, and long-context tasks. Token constraints further reduce effectiveness in large-scale or complex systems.
Yi, G. [23]	2023
Guilherme, V. [31]	2023
Siddiq, M. [32]	2023
Yang, L. [33]	2024
Chang, H. [34]	2025
Xu, J. [35]	2025
Wang, Y. [36]	2025
Elvira, T. [24]	2023	Providing examples, basic context, or integrating LLMs into workflows enables the generation of effective and human-readable tests. Some models support collaboration through logic explanations and bug reproduction.	The use of single-shot prompting, small datasets, or simplified workflows limits real-world applicability. Incorrect or incomplete assertions often require manual adjustments. Generalization to more complex systems remains limited.
Li, V. [29]	2024
Zhang, Y. [30]	2023
Bayri, V. [37]	2023
Plein, L. [38]	2023
Koziolek, H. [39]	2024
Yin, H. [40]	2024	Structured prompting techniques such as few-shot chaining and diversity-guided optimization, boost test quality and coverage. Enhanced performance is observed when prompt inputs are rich in context or aligned with domain-specific expectations.	Test generation quality varies depending on model size and context window. Studies often assume accurate focal methods or omit dynamic refinement steps. Manual evaluation introduces subjectivity and reduces scalability.
Rao, N. [41]	2024
Zhang, Q. [42]	2024
Jiri, M. [43]	2024
Ryan, G. [44]	2024
Gao, S. [45]	2025
Wang, W. [46]	2024	Language models can produce highly executable and diverse tests, but face limitations in reasoning over code paths and logic.	Evaluations are based on Python LeetCode problems, which may not reflect the complexity or diversity of real-world systems.
Ouédraogo, W. [47]	2024	Tree-of-Thought and Chain-of-Thought strategies enhance the readability and maintainability of test cases while increasing partial coverage.	Despite syntactic correctness, most tests fail to compile successfully. Low compilability rates and test smells persist across generated outputs.
Khelladi, D. [48]	2025	By combining cross-language input and high-temperature sampling, LLM-generated test cases can be unified into comprehensive test suites without execution-based feedback.	Experiments focus on self-contained problems. Generalization is difficult for codebases with interdependent inputs or methods.
Sharma, R. [49]	2025	Extracting concrete input and output specifications from prompts enables the detection of specification-violating outputs across multiple models more effectively.	The method does not support multi-turn interactions or structured prompts, limiting its application in more complex testing workflows.
Li, Y. [50]	2025	Prompting LLMs with follow-up questions enhances bug localization by triggering self-correction and uncovering previously missed errors. This iterative engagement mirrors human review processes and increases detection beyond initial outputs.	Hallucinations occur across test generation, error tracing, and bug localization, including non-executable outputs and incorrect diagnoses. Performance varies widely across models, with some (e.g., New Bing) failing to detect any bugs. Low overall test quality and model inconsistency limit standalone applicability in real-world testing workflows.
Godage, T. [51]	2025	Claude 3.5 achieved the highest performance with 93.33% success rate, 98.01% statement coverage, and 89.23% mutation score, outperforming GPT-4o and others.	Results are limited to JavaScript, and generalizability to other languages remains untested.
Chowdhury, S. [52]	2025	Static analysis-guided prompts enabled LLaMA 7B to generate tests for 99% of 103 methods, resulting in 175% improvement. Similar gains were observed for LLaMA 70B and CodeLLaMA 34B.	The study only evaluated whether a test was generated, without assessing its correctness, coverage, or runtime behavior. It also lacks dynamic context, which limits test accuracy.

Table 2. Overview of primary studies on feedback-driven approaches.

Study	Year	Finding	Limitation
Kang, S. [53]	2022	Incorporating natural language prompts, issue reports, or scenario understanding enables LLMs to generate test cases tailored to real-world failures or UI contexts.	Effectiveness declines in scenarios involving external resources, project-specific configurations, or undocumented dependencies, often necessitating manual intervention.
Yu, S. [55]	2023
Nashid, N. [56]	2025
Chen, Y. [19]	2023	Breaking down test generation into structured subtasks such as focal context handling, intention inference, oracle refinement, and type correction leads to improved test quality, correctness, and coverage.	Multi-stage generation strategies encounter scalability issues due to LLM token limits, dependence on curated demonstration examples, and difficulties in retrieving relevant context within large codebases.
Yuan, Z. [26]	2023
Chen, M. [57]	2024
Ni, C. [58]	2024
Liu, R. [59]	2025
Bhatia, S. [60]	2023	Iterative prompting and feedback loops significantly improve test coverage and assertion quality, often rivaling or exceeding the quality of traditional tools.	Performance is constrained when dealing with ambiguous documentation, less modular code, or limited and fixed code examples that reduce generalizability.
Schäfer, M. [61]	2023
Kumar, N. [62]	2024
Wang, Z. [63]	2024
Lahiri, S. [54]	2022	Execution feedback, control flow analysis, and hypothesis-driven iteration guide LLMs to generate more accurate, high-coverage, and fault-detecting tests.	Limitations include invalid test cases, misclassified outputs, and reduced performance on complex features such as inheritance, behavioral equivalence, or low-confidence assertions.
Etemadi, K. [64]	2024
Yang, C. [65]	2024
Alshahwan, N. [66]	2024
Pizzorno, J. [67]	2025
Jain, K. [68]	2025
Gu, S. [69]	2025
Straubinger, P. [70]	2025
Chen, B. [25]	2022	Combining LLM-based generation with dual execution strategies, property-based retrieval, or static analysis improves correctness and adaptability across applications.	Large-scale executions, dependency on existing test-rich repositories, and the high cost of repeated LLM invocations pose significant barriers to scaling and full automation.
Zhang, Z. [71]	2024
Zhong, Z. [72]	2024
Pan, R. [73]	2025
Gu, S. [74]	2025	Employing evolutionary algorithms, template-guided repair, and task-planning strategies powered by LLMs improves the reliability and executability of generated test cases across both open-source and industrial software systems.	Generalizability remains limited due to dataset constraints, potential accuracy decline in long evolution loops, and dependence on high-quality training signals.
Li, K. [75]	2025
Cheng, R. [76]	2025
Liu, J. [77]	2025	STRUT improves test generation for C programs by using structured seed-case prompts with feedback optimization. It achieves a 96.01% test execution pass rate, 51.83% oracle correctness, 77.67% line coverage, and 63.60% branch coverage. Compared to GPT-4o, it improves line coverage by 37.34%, branch coverage by 33.02%, test pass rate by 42.81%, and oracle correctness by 32.90%.	STRUT struggles with complex conditional branches and suffers from test redundancy. Despite generating 21% more test cases than SunwiseAUnit, it offers only ~6% coverage improvement, indicating inefficiencies in pruning duplicates and handling nuanced logic.

Table 3. Overview of primary studies on model fine-tuning and pre-training.

Study	Year	Finding	Limitation
Tufano, M. [28]	2021	Fine-tuning or adapting LLMs with project-specific or focal context information significantly improves test case quality, including accuracy, readability, and coverage. For example, Rehan et al. [80] fine-tuned LLaMA-2 using QLoRA and achieved an F1 score of 33.95%, with human validation confirming clarity and edge case handling.	Reliance on contextual signals or heuristics such as focal function matching or the availability of developer-written tests can undermine output reliability when these elements are missing, incomplete, or inconsistent across projects.
Shin, J. [79]	2023
He, Y. [81]	2024
Rehan, S. [80]	2025
Tufano, M. [82]	2022	Pre-training followed by fine-tuning, particularly for assertion generation, improves syntactic and semantic quality of test cases and assertions.	Common challenges include limited generalizability due to restricted datasets, difficulties in reproducing results, and integration issues that arise from incomplete context, missing helper functions, or truncated outputs.
Zhang, Q. [83]	2025
Primbs, S. [84]	2025
Rao, N. [20]	2023	Incorporating domain knowledge or verification strategies into fine-tuned models yields more readable, accurate, and coverage-enhancing test cases.	Limitations include redundancy in generated outputs or compatibility issues due to deprecated APIs or private access restrictions.
Alagarsamy, S. [78]	2023
Storhaug, A. [85]	2024	Parameter-efficient or large-scale fine-tuning with effective prompts can match or surpass full fine-tuning in generating high-quality unit tests with reduced cost.	Limitations stem from high variance in model performance and underexplored tuning configurations, which may affect consistency and full potential.
Alagarsamy, S. [86]	2025
Shang, Y. [87]	2025	Fine-tuned LLMs outperformed traditional tools across tasks. DeepSeek-Coder-6b achieved a 107.77% improvement over AthenaTest (33.68% vs. 16.21%) in test generation. CodeT5p-220m reached a CodeBLEU score of 88.25%, outperforming ATLAS (63.60%).	Bug detection remains limited. DeepSeek-Coder-6b found only 8/163 bugs with 0.74% precision. Over 90% of test failures were build errors, showing weak alignment between syntactic and semantic correctness.

Table 4. Overview of primary studies on hybrid approaches.

Study	Year	Finding	Limitation
Dakhel, A. [88]	2023	Improving fault-detection and test robustness through reinforcement learning, mutation-based prompting, and differential testing strategies that target semantic bugs and promote behavioral diversity.	Effectiveness relies on subtle indicators such as surviving mutants or inferred program intentions, but these signals can become unreliable in complex or noisy codebases, leading to redundant test cases or false positives.
Steenhoek, B. [89]	2023
Li, T. [90]	2023
Liu, K. [91]	2024
Zhang, J. [92]	2025
Lemieux, C. [21]	2023	When integrated into search-based or hybrid testing workflows such as those used within development environments or automated service testing, LLMs enhance usability and increase test coverage by enabling guided exploration alongside feedback from code execution.	Key challenges include slow response times during LLM queries, limited large-scale empirical validation, and restricted applicability to environments such as open-source systems or narrowly scoped testing domains.
Sapozhnikov, A. [93]	2025
Li, J. [94]	2024
Li, K. [95]	2024	Employing multi-agent systems or execution-aware test generation pipelines significantly improves test accuracy, reliability, and fail-to-pass effectiveness by structuring test and code generation workflows.	The dependency on powerful language models and customized toolchains restricts the adaptability of these approaches across different model architectures, programming languages, and real-world software development settings.
Huang, D. [96]	2025
Mündler, N. [97]	2025
Taherkhani, H. [98]	2024	Techniques such as chain-of-thought prompting, input repair, and guided mutation enhance test executability and coverage in zero-shot or low-supervision settings.	Syntactic inconsistencies, ambiguous or incomplete documentation, and unstructured datasets continue to result in invalid test inputs, ultimately undermining test coverage and the effectiveness of automated testing pipelines.
Lops, A. [99]	2025
Yang, R. [100]	2024
Xu, J. [101]	2024	Incorporating symbolic execution, context awareness across multiple files, and specialized domain knowledge enables LLMs to generate high-quality unit tests in complex ecosystems such as formal verification tools or statically typed languages.	Limitations stem from scalability, path explosion, and difficulty modeling sophisticated code features or environmental interactions accurately.
Zhang, Y. [102]	2025
Ouédraogo, W. [103]	2025	Mining concrete inputs from bug reports before seeding EvoSuite or Randoop boosts both effectiveness and efficiency: BRMiner exposes 58 extra bugs across 13 Defects4J projects (24 beyond default EvoSuite) and raises branch/instruction coverage by ≈13 pp/12 pp while producing about 21% fewer test cases.	Effectiveness relies on high-quality, version-aligned bug reports; stale or noisy reports can seed irrelevant inputs, and the GPT-3.5-turbo filtering step lacks semantic validation, occasionally admitting plausible yet ineffective test data.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Celik, A.; Mahmoud, Q.H. A Review of Large Language Models for Automated Test Case Generation. Mach. Learn. Knowl. Extr. 2025, 7, 97. https://doi.org/10.3390/make7030097

AMA Style

Celik A, Mahmoud QH. A Review of Large Language Models for Automated Test Case Generation. Machine Learning and Knowledge Extraction. 2025; 7(3):97. https://doi.org/10.3390/make7030097

Chicago/Turabian Style

Celik, Arda, and Qusay H. Mahmoud. 2025. "A Review of Large Language Models for Automated Test Case Generation" Machine Learning and Knowledge Extraction 7, no. 3: 97. https://doi.org/10.3390/make7030097

APA Style

Celik, A., & Mahmoud, Q. H. (2025). A Review of Large Language Models for Automated Test Case Generation. Machine Learning and Knowledge Extraction, 7(3), 97. https://doi.org/10.3390/make7030097

Article Menu

A Review of Large Language Models for Automated Test Case Generation

Abstract

1. Introduction

2. Background

2.1. Large Language Models (LLMs)

2.2. Software Testing

2.3. Applications of LLMs in Test Case Generation

3. Methodology

3.1. Research Questions

3.2. Information Sources

3.3. Eligibility Criteria

3.3.1. Inclusion Criteria

3.3.2. Exclusion Criteria

3.4. Data Collection Procedure

3.5. Results

4. Findings

4.1. RQ-1: What Methods Have Been Proposed for Using LLMs in Automated Test Case Generation?

4.1.1. Prompt Design and Engineering

4.1.2. Feedback-Driven Approaches

4.1.3. Model Fine-Tuning and Pre-Training

4.1.4. Hybrid Approaches

4.2. RQ-2: How Effective Are LLMs in Improving the Quality and Efficiency of Test Case Generation Compared to Traditional Methods?

4.2.1. Improvement over Existing Tools

4.2.2. No Clear Improvement over Existing Tools

4.2.3. Mixed/Context-Dependent Outcomes

4.3. RQ-3: What Future Directions Are Suggested in the Literature for Using LLMs in Test Case Generation?

4.3.1. Hybrid Approaches and Integration with Existing Tools

4.3.2. Prompt Engineering

4.3.3. Project-Specific Knowledge

4.3.4. Scalability and Performance

4.3.5. Expanding Applicability: Languages, Models, Metrics and Benchmarks

5. Discussion

5.1. Synergistic Potential of Combined Approaches

5.2. Fragmentation in Evaluation Practices

5.3. The Role and Risks of Contextual Information

5.4. Real-World Applications and Industrial Evaluation

5.5. Gaps in Tool Ecosystem and Language Coverage

5.6. Beyond Functional Correctness: Addressing Non-Functional Requirements

6. Threats to Validity

6.1. Threats to Internal Validity

6.2. Threats to External Validity

7. Conclusions and Future Work

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI