Next Article in Journal
A Multi-Class Labeled Ionospheric Dataset for Machine Learning Anomaly Detection
Previous Article in Journal
A Dataset of Environmental Toxins for Water Monitoring in Coastal Waters of Southern Centre, Vietnam: Case of Nha Trang Bay
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automated Test Generation Using Large Language Models

1
GenerativeAI Academic Research Team (GART), Capgemini Insights & Data, 54-202 Wroclaw, Poland
2
Faculty of Applied Mathematics, Silesian University of Technology, 44-100 Gliwice, Poland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Data 2025, 10(10), 156; https://doi.org/10.3390/data10100156
Submission received: 7 April 2025 / Revised: 15 July 2025 / Accepted: 11 September 2025 / Published: 30 September 2025

Abstract

This study explores the potential of generative AI, specifically Large Language Models (LLMs), in automating unit test generation in Python 3.13. We analyze tests, both those created by programmers and those generated by LLM models, for fifty source code cases. Our main focus is on how the choice of model, the difficulty of the source code, and the prompting strategy influence the quality of the generated tests. The results show that AI models can help automate test creation for simple code, but their effectiveness decreases for more complex tasks. We introduce an embedding-based similarity analysis to assess how closely AI-generated tests resemble human-written ones, revealing that AI outputs often lack semantic diversity. The study also highlights the potential of AI models for rapid test prototyping, which can significantly speed up the software development cycle. However, further customization and training of the models on specific use cases is needed to achieve greater precision. Our findings provide practical insights into integrating LLMs into software testing workflows and emphasize the importance of prompt design and model selection.

1. Introduction

In an era of rapid development of artificial intelligence (AI), one of the most promising tools is Large Language Models (LLMs) [1]—advanced generative AI models with billions of parameters, pre-trained on vast amounts of text data. They have recently appeared as a breakthrough technology in natural language processing (NLP), and have outperformed traditional approaches in various applications, like text generation, information extraction and document summarization.
Meanwhile, code testing is a key part of the programming process, aimed at ensuring the correctness and reliability of software [2]. However, despite its immense value, many programmers are reluctant to undertake this task. Testing provides concrete, measurable results that can be analyzed and evaluated. Unlike code generation, which can be difficult to understand and evaluate without very thorough testing, test generation offers objective results that can be interpreted directly. One of the most basic and common forms of code test is a unit test, which focuses on verifying the accuracy of a small block of code, like a function or class method, in isolation from the rest of the application [3]. Writing these tests is one of the main responsibilities of programmers, but it is a repetitive and time-consuming process. Therefore, developers have long been working on automating them, including using classic state-of-the-art approaches like the search-based technique [4].
In recent years, LLMs have undergone remarkable progress in many text generation applications, like text summarizing, text generation or applications in the field of chatbot agents [5]. Most importantly for us, these models have the capacity to generate code, including unit tests [6]. Despite the crucial role of testing, relatively few projects have had the time and resources to conduct a proper and extensive empirical study of real-world usages of LLMs for automatic test generation. It is often seen as a tedious and time-consuming task that does not directly produce visible results in the form of a new functionality. According to Wang et al.’s study [7], topics regarding generating test inputs or code repairing are more popular in terms of the number of articles released. Regardless of this current trend, we believe that unit test generation could be one of the most significant aids for programmers. Introducing automation into the test-writing process using AI could significantly relieve developers of this tedious duty, allowing them to focus on more creative aspects of programming.
Recent research in this area has mainly been focused on developing various sophisticated frameworks, like a combination of a code coverage analyzer and an LLM [8], or processing an LLM model’s instructions with additional context analysis [9]. It may be challenging to implement these in practice due to their complexity. Thus, in the case of real-life scenarios, the simplest solution might be to generate tests exclusively through using an LLM with proper instructions.
What remains unclear in the current literature is how the specific combination of model type, task complexity and prompt strategy affects the success of automated test generation. Our study addresses this by providing a systematic, quantitative evaluation of test generation across different Python source code cases, categorized by difficulty.We examine both their functional correctness (via execution and code coverage) and their semantic similarity to human-written tests (via embedding metrics).
In doing so, we aim to clarify not only whether LLMs can generate unit tests, but under what conditions they perform well—and where they fall short. Moreover, our study opens up directions for incorporating human feedback into the LLM-based test generation process. As recent work has shown, crowd-sourced annotations can significantly improve the alignment of LLM outputs with human expectations and coding standards, either through reinforcement learning with human feedback (RLHF) frameworks [10] or by using generated tests to evaluate and rank multiple candidate solutions [11]. These strategies can further increase the reliability and correctness of automatically generated test code, especially for complex codebases. Our findings are intended to guide practical integration of LLMs into developer workflows, especially in the early stages of test prototyping.

2. Materials and Methods

2.1. Dataset

This study focused on taking advantage of the advanced capabilities of LLMs, which perform well in generating and processing code, especially when the code is well-structured and not overly complex. While these models show high efficiency in handling more accessible tasks, they tend to encounter challenges with more complex or multi-layered code [12]. By focusing on codes which align with the strengths of currently available LLMs, this approach maximizes their potential in test code generation.
To start exploring the possibilities of LLMs in writing code tests, it was necessary to first create a validation collection containing code fragments with corresponding tests. To achieve this, repository sites (mostly Github) were searched for projects written using the Test-Driven-Development (TDD) methodology [13] in Python. Unfortunately, to our dismay, many projects, even very well-known ones, had inadequate tests; hence, it became necessary to create the dataset from scratch. The source and test codes from the repositories were first selected based on usability (relatively simpler codes were selected, and codes which required extra files, such as databases or images, were not included). Next, the selected codes were qualitatively analyzed by our team members for functionality and quality, and finally adjusted or re-written as necessary to ensure they were of high quality. For example, if the test was written using the method of assert statements and did not contain an additional function that would ensure the test would be executed, it had to manually be added to the test code script.
It is suspected that the source and test codes found on the repository sites were not of the best quality due to the authors being inexperienced in programming (most of the projects found were created for learning purposes rather than commercial or research purposes), and other projects were not finished by their authors. This does not reflect true human abilities in writing test codes, with experienced programmers designing test codes which are both executable and of high quality; for this reason we decided to manually adjust the source and test codes, which resulted in our final dataset of 50 pairs of source and test codes. The database collection scheme is visible in Figure 1.
To maximize the potential of LLMs in test code generation, this study initially focused on well-structured and straightforward source codes. However, simpler codes do not always reflect the complexity of code that programmers encounter in real-world scenarios. To gain a more accurate understanding of LLM performance in simpler versus more challenging codebases, the source codes were categorized based on the difficulty of generating unit tests.
Having extensively searched through various repositories, it was only possible to extract 46 initial code cases. After revision, 2 outliers were identified for which test-writing was not feasible, and the 44 remaining cases were analyzed for difficulty. All of them proved to be rather simple, and to ensure a more comprehensive dataset, our team decided to collect harder codes.
Not having found any harder codes in repositories, our research team, consisting of data scientists rather than professional programmers or testers, utilized GPT4 in conjunction with human analytical expertise to manually obtain more complex codes, specifically 5 hard codes and 1 medium code.
The source codes were then divided into four categories based on the difficulty of writing corresponding unit tests:
  • Hard: These codes require a deep understanding of complex mathematical concepts, such as the ability to design and test fractal dimensionality, along with advanced implementation skills in using Python libraries tailored for complex computations and visualizations.
  • Medium: These codes demand domain-specific knowledge to create appropriate test cases. For example, testing the logic of a chess game requires understanding the game’s rules. Additionally, these codes often involve mocking certain parts of the logic, like I/O operations, to ensure unit tests run in isolation, following best practices [14].
  • Easy and Very Easy: The remaining codes were classified based on their use of programming concepts like abstraction, inheritance or design patterns.
After this categorization, the final dataset included 20 very easy, 16 easy, 9 medium and 5 hard codes. This broader range of difficulties provided a better understanding of how LLMs perform across different levels of test generation challenges.
The dataset contains pairs of source codes and human-written tests. Codes are the basis for test generation—the source codes are mainly codes of simple functions with simple logic, e.g., Fibonacci series code. However, we also included some more complex code written in object-oriented programming (OOP), such as chess logic or client–server architecture codes. Each source code was manually tested and corrected as needed to ensure the greatest possible chance for LLMs to generate executable tests based on the source code.
The source codes are structured in the following format:
  • Imports of libraries which are used in the program.
  • Functions or classes: the core logic of the program. Each is defined with the clear purpose of performing a specific task.
The dataset includes human-made test cases written with two different testing methods to reflect real-world diversity in software development. The two main approaches are pytest and unitest. The test codes are structured in the following format:
  • Imports of
    • The libraries (such as unittest or pytest) required for testing;
    • The libraries which are used in the test;
    • The source code.
  • Test cases:
    • In the case of using the unittest library:
      Style: Formal and object-oriented.
      Structure: Tests are organized into classes that inherit from
      unittest.TestCase. Each method within the class corresponds to a specific use case.
      Assertions: Use of assertion methods such as
      self.assertEqual, self.assertTrue, etc.
      Advantages: Well-organized and grouped tests, suitable for complex test suites, with built-in support for test discovery and reporting.
      Disadvantages: More verbose and requires additional boilerplate.
    • In the case of using the pytest library or plain assert statements:
      Style: Concise and declarative.
      Structure: Tests are written as standalone functions, each beginning with the prefix test_.
      Assertions: Use of Python’s built-in assert statement for validating expected outcomes.
      Advantages: Minimal boilerplate, easy to read and write.
      Disadvantages: Less structured and lacks the formality required in large-scale projects.
  • Script execution clause.

2.2. Models

In recent years we have observed dynamic growth in the development of Large Language Models [1]. Such a rapid release of new models has resulted in a huge variety of them. Our research team had the possibility of choosing between open- or closed-source, foundation or fine-tuned models for the research tasks. Due to time constraints and the relatively small number of members of the team, we decided to use models hosted by an external provider with API access.
This decision narrowed down the number of potential models, excluding practically all open-source models fine-tuned to code generation like CodeLLlama or StarCoder. Our team chose the Amazon Bedrock service among the cloud services available because it offers a significant number of state-of-the-art foundation models. Of the available LLMs, we ultimately used the following:
  • Llama3 70B: According to the authors [15], this LLM achieves state-of-the-art results on commonly used benchmarks, including code generation. It has the advantage of being open-source, thus providing transparency in terms of how it works, and it can potentially be hosted on premise.
  • Mistral Large: Mistral models, similarly to Llama models, are open-source, and their license allows for commercial usage. The authors state that these models, including the smallest available model, whose performance is comparable to that of larger models [16], can achieve state-of-the-art results. In practical terms, the smallest model only has 7 billion parameters, which may be more cost-effective in large-scale projects. According to external research [17], Mistral 7B is also almost two-thirds less costly than the competitive Llama 8B in the case of using Bedrock, allowing for greater accessibility.
  • Claude 3 Sonnet: According to the authors [18], this LLM is capable of achieving state-of-the-art results; in the benchmarks of the authors, it is stated that it outperforms one of the most famous LLMs—GPT4.
Our team decided only to use the largest available versions of models to establish a baseline. The rationale for this is that if the most advanced and complex models demonstrate limitations in generating automatic unit tests, it is likely that smaller and less sophisticated models would encounter similar or greater performance challenges. Moreover, the primary goal of this study is to assess the capacity of test automatization, not to find the most effective model; however, from a practical standpoint, it is worth considering solely the models from model families which can potentially be used in production environments. Lastly, due to the time constraint and limited resources, it was decided to postpone testing GitHub Copilot, which is one of the most popular tools powered by GPT4 for generating unit tests.
The time constraint also limited our team to using the models in zero-shot scenarios, i.e., without any fine-tuning on the prepared dataset, so the models used for the study only use their general knowledge. Nonetheless, it was possible to influence their final outputs through preparing relevant instructions for the models (prompts).

2.3. Experiments

2.3.1. Code Evaluation

This section details the methodology employed to score the performance of GenAI models in generating test code alongside human-written test code using two metrics: code coverage and execution failure.
  • Code coverage—This is a quantitative measure of how much of the source code was executed when the test code was run. It provides insight into the effectiveness of the test code in executing different parts of the code. It is a widely used metric in software engineering research [19] as well as in research regarding the effectiveness of LLMs in generating unit tests [7]. In this experiment, the test codes of GenAI-generated and human-written test codes were instrumented using Coverage.py in conjunction with pytest in Python. The generated and written tests were checked against the same source code and the coverage tool generated reports detailing the percentage of source code lines that were covered. The coverage percentages were then further analyzed with statistics (mean, median, and standard deviation) for each prompting method as well as each GenAI model/human author and difficulty.
  • Execution Failure—In addition to code coverage, the number of failures was counted. This metric reflects the number of test cases that fail to execute due to errors in the test code itself. It has been used in research to evaluate the performance of LLMs [20]. Execution failure was measured using the same process as the coverage measure described above. If the coverage measure did not return a result for any test case, or if the code coverage was 0, this was due to errors in the test code, and the test case for the specific model was labeled as having exhibited execution failure.

2.3.2. Embedding Similarity

To measure the similarities between models’ and human authors’ test codes, the codes were first transformed using the Claude 3 model, which was instructed to replace each of the test codes line-by-line with plain text, following instructions which specified that it must rename specific elements of code with specific semantic categories of words, as is shown in Figure 2. For example, import statements were to be renamed as fruits, variables were to be renamed as items of clothing, etc. The aim of this transformation was to aid in generalizing the test codes so that they would be more easily comparable, while still retaining important information regarding the structure of the code. These transformed codes were then converted into embeddings using Amazon Titan Embeddings. The conversion of language text into numerical representations, known as embeddings, is a practice widely used for tasks within language processing and is often applied to pieces of code [21]. Finally, the embeddings could be compared using three measures: Euclidean distance, cosine similarity and dot product.
  • Euclidean distance—This measures the straight-line difference between two points in the multi-dimensional space of embeddings. It captures how far the embeddings are from each other—the further away they are, the more different they are. Euclidean distance is used as a basic metric for measuring text embedding similarity, among other uses [22].
  • Cosine similarity—To support the similarity measure, cosine similarity was also calculated between the embeddings. It measures the cosine of the angle between two embedding vectors and is one of the most commonly used metrics to compare text similarity based on semantic embeddings [23]. It provides insight into how similar the directions of the embeddings are. Cosine similarity ranges from −1 to 1, where 1 means identical directions and −1 opposite directions. As such, higher cosine similarity values suggest the compared embeddings are more similar.
  • Dot product—In addition to cosine similarity, a similar measure is used—the dot product. This measures the magnitude of embedding overlap. It is similar to cosine similarity, but also considers the length of vectors [24]. A higher dot product suggests the two compared embeddings are more similar.

2.4. Prompts

For the LLMs chosen, the task of generating unit tests was described in four different instruction types (prompts), each corresponding to a different prompting approach commonly found in the literature [25]. The first attempt, shown in Listing 1, is “baseline”, because it includes the simplest required part: instructions to generate test code for a given source code.
Listing 1. The baseline prompt.
You are a programming assistant. Your task is to provide a unit test code snippet to the source code below. Please import the source code as ’source_code’ module in the code snippet.
{source code}
The baseline prompt also uses a role assignment technique. According to current research [26], describing the role of models within the prompt helps to achieve the desired style and tones, or more importantly, direct them towards the given task. The team decided to assign the models the role of programming assistant, anticipating that they would be more effective in code generation. It was observed in previous experiments that the generated tests experienced issues with importing the source code. Models were unable to infer the proper name of the source module: source code imports were not included or placeholder imports were added. An example of this misbehavior is shown in Listing 2.
Listing 2. Example of issue with importing the source code.
from your_module import x, y, z # Replace with the actual module name
To solve this problem, the models were supported with direct instructions on how to properly import the source code (“Please import the source code as s o u r c e _ c o d e module in the code snippet”.) After adding this clause, all models were observed to correctly include the source code imports. This fix’s drawback is that the source code must be placed in the module with the same name.
The next attempt (Listing 3) was focused on providing the model with more specific instructions. In addition to the baseline approach, we added some tips on how to write good-quality unit tests based on best practices.
Listing 3. The prompt with rich instructions.
Act as a Python software test engineer. Your task is to provide a unit test code snippet to the source code below. Please import the source code as ’source_code’ module in the code snippet. Each test must have the following parts:
(1) Arrange: create and set necessary objects / data for the test
(2) Act: call the tested method from source code and get the actual value
(3) Assert: check the expected and actual values.
Don’t forget to test edge cases and all possible exceptions which can be raised.
<source_code >
{source_code}
</source_code >
The main improvements were the addition of the instruction to use an ’arrange, act, assert’ pattern and the reminder about edge cases and exception testing. The model’s role was changed to a Python software test engineer, which is more appropriate in a situation where we are only testing Python code.
The next experiment included rich instructions and an example of a unit test. This approach is called few-shot prompting [27]. The prompt template is shown in Listing 4; the code is replaced with an example mark. The example contains source code with a simple function which returns the result of dividing the first argument by the second. Then, the reference unit tests are placed, containing the “happy path” (the correct arguments) and the edge cases: the second argument is 0, and the arguments have an incorrect type. The example was anticipated to help the models improve their testing of edge cases by showing how to select the problematic ones.
Listing 4. The few-shot prompt.
Act as a Python software test engineer. Your task is to provide a unit test code snippet to the source code below. You also get a example of a unit test. Please import the source code as ’source_code’ module in the code snippet. Each test must have the following parts:
(1) Arrange: create and set necessary objects / data for the test.
(2) Act: call the tested method from source code and get the actual value.
(3) Assert: check expected and actual values. Don’t forget to test edge cases and all possible exceptions which can be raised.
<example >
{example}
</example >
<source_code >
{source_code}
</source_code >
Listing 5 shows the last tested prompting approach, called chain-of-thought [28]; the code snippets were replaced with { } symbols (see full prompt in the repository). The primary objective was to help with the complex reasoning required of the model by supporting with intermediate reasoning steps. The instruction begins with assigning a role to the model and providing an identical source code example to that in the few-shot prompting experiment. Then, the model is taught how to correctly include the source code and write the unit test while including best practices. The model is also given advice on how to recognize edge cases. Finally, the instructions are supported with an example of a unit test written in line with the given guidelines. As in the previous experiment, the goal was to improve the quality of the generated unit test by choosing correct edge cases to test for.
Listing 5. The chain-of-thought prompt.
Act as a Python software test engineer. Your task is to provide the unit test code snippet to the source code below:
<source_code >
{example of the source code}
</source_code >
The source code is located in the separate file, so you must remember to import the source code as ’source_code’ module in the code snippet.
First you must check if the tested function works correctly in the following way:
(1) Arrange: create and set necessary objects / data for the test.
(2) Act: call the tested method from source code and get the actual value.
(3) Assert: check expected and actual values.
Then you must also check the edge cases and all possible exceptions which can be raised. Edge cases depend on the tested source code logic. In this case, it’s wise to check the wrong type of the argument and if the written exception is raised correctly. Other cases worth checking are when the input values are outside the range, a null value or an empty list/dict.
To sum up, the final unit test code snippet should look like this:
<example >
{example of the unit tests }
</example >
Your task is to provide a unit test code snippet to the source code below:
<source_code >
{source code}
</source_code >

3. Results

3.1. Generative AI and Human-Written Code Evaluation

This section presents the findings from the evaluation of the effectiveness of Generative AI models in generating test codes. The chosen GenAI models (Claude, Llama and Mistral) were evaluated based on the number of execution failures and on their code coverage. Two outliers of test cases were removed from the dataset, for which both humans and LLMs alike provided test code that was not executable or was executable with minimal code coverage. The final dataset consisted of fifty test case samples. The percentage of execution failures, as well as the mean, median and standard deviation of source code covered by the test code, are reported for the baseline, rich-instruction, example and chain-of-thought prompting methods. The results can be found in Table 1, Table 2, Table 3 and Table 4.
Table 1 summarizes the results of the execution failure and code coverage metrics for the human-written and baseline-prompted LLM models’ test codes. Both the human and Claude model’s test codes did not experience execution failures, while Llama and Mistral’s test codes had execution failures, initially suggesting that the Claude model with baseline prompting is a high-performing model. In terms of code coverage, with baseline prompting, the results show evidence that the Claude and Llama models performed better than the Mistral model. Nonetheless, none of the models seem to have generated test codes which would perform better on the source code than human-written codes, for any of the code categories, regardless of difficulty.
Table 2 compares the execution failure and code coverage results of the human-written and rich-instruction-prompted LLM models’ generated test codes. With this type of prompting, in terms of execution failure, both the Claude and Llama models performed as well as human programmers, with no execution failures, while some of the Mistral-generated tests did not execute. Rich-instruction prompting resulted in the Llama model performing best overall among the three models in terms of code coverage percentage, followed by Mistral and Claude. However, none of them performed better than programmers. As demonstrated by our results, while Claude appeared to have performed best with baseline prompting, it performed much more poorly with rich-instruction prompting. On the other hand, the Mistral model seems to have benefited from having been given rich instructions in its prompt, at least in terms of code coverage.
Table 3 reports the results of execution failure and code coverage for human and example-prompted LLM models’ test codes. Among the models, Mistral achieved the lowest execution failure, equal to that of humans, of 0%, while Llama and Claude experienced some failures. While Claude performed excellently in its generation of executable tests with baseline and rich instructions, when additionally given an example in the prompt, some of the tests it generated were not executable. When it comes to code coverage, it appears that the Claude and Llama models performed better than Mistral; however, yet again, they did not perform better than human code.
Table 4 compares the execution failure and code coverage rates of the human-written and chain-of-thought-prompted LLM models’ generated test codes. It appears that all LLM models experienced difficulties with generating executable code with this type of prompting, with Claude, Mistral and Llama having execution failures for medium-complexity code. Notwithstanding its high execution failure rate, Claude was able to generate high-coverage test codes in the very easy- and easy-complexity categories. The two other models, although also having high execution failure rates for medium code, did not perform as well in the easy or hard codes. As with the previous prompting methods, the models with chain-of-thought prompting did not perform better than programmers either.
As reported, the proportion of execution failures across AI models, prompting methods and difficulties ranged from 0% to 44% of test cases. These failures may have resulted from incorrect imports, or other errors within the generated code that would have prevented the code from being executed on the source code. The general assumption of test codes is that they will execute successfully; therefore, any values above 0 are considered to be high. As such, the results of this analysis indicate that the tests designed by the LLMs were relatively ineffective.
Good-quality tests should test 100% of the source code, to account for all possible test cases, including the extreme ones. Although neither the human nor the LLM code coverage reached 100% for all difficulty levels, it is evident that human-written tests were much closer to being considered good-quality tests than those written by LLMs. The median of model code coverage is lower than the human median for nearly all cases of prompting and difficulty level. One notable point is the discrepancy between the mean and median values—this was largely caused by individual cases of a few source codes being covered by only a small percentage of either the human or AI code. Given the relatively small sample size, these significantly lowered the mean values across models’ and humans’ code coverage and also influenced the standard deviation. Code coverage values ranging from very low (as in the mentioned test case) to very high (100% code coverage cases) led both the models’ and humans’ test code coverage SD to range between 4.78 and 33.07, which is especially high for standard deviation.
These results do not provide a clear answer as to whether a certain prompting method or model was the most successful in generating test code. Models differ in how they performed on the assigned task. Overall, among the models, it would seem that Claude with baseline prompting has the highest values of code coverage across difficulties and 0 execution failures. Llama exhibits a similar pattern, with baseline prompting generating codes with the highest code coverage. Contrary to Claude and Llama, in terms of median, Mistral performed best with rich-instruction prompting. The initial speculation was that an increase in details given in the prompt would result in better tests being generated by the models. This is not reflected in the results, as the simplest prompt methods, baseline and rich instruction, resulted in LLMs producing tests with lower execution failure rates and higher code coverage than in the more detailed example and in chain-of-thought prompting conditions. Nonetheless, these models, regardless of the prompting method, did not perform as well as the human-written code.
The average test coverage of the GenAI models exhibits a decreasing pattern with increasing difficulty. That is, both humans and LLMs were highly successful in writing very easy and easy tests, and their performance decreased slightly for medium-difficulty tests and decreased significantly for hard-difficulty tests. To better illustrate these findings, they are visualized in Figure 3a–d.
The charts visualize the performance of both human and model tests for each of the prompting methods. The main finding is that while LLMs perform well on very easy and easy code—nearly as well as humans, or in some cases slightly better than humans—in the case of medium and hard codes, while humans perform slightly worse than LLMs on very easy or easy code, LLMs perform significantly worse. This performance drop is evident based on the cases used in this study of relatively simple codes, which merely require some domain and mathematical knowledge, mockups and implementation. It can be expected that performance will drop even more as the difficulty of the test-writing increases, which would likely be the case in the majority of practical applications. Our results suggest that while LLMs are proving to be effective tools for automating test generation for straightforward tasks, they still struggle with more demanding coding scenarios that humans navigate more effectively, highlighting that a gap remains between AI and human expertise in software development.

3.2. Similarity of GenAI and Human-Written Test Code

As an extension to evaluating how well the chosen GenAI models perform alone, we also examined how similar GenAI models’ generated test codes were to the human-written ones. It can be assumed that the human-written test codes act as a ‘gold standard’ for comparison, and, as seen in the previous tables, the number of execution failures was 0 and the median code coverage was high for the human-written code, which suggests a high quality of code. Moreover, it would be insightful to examine the similarities between the generated codes themselves.
To establish whether generative AI produces similar results overall, regardless of the model, or if there are significant differences between the models’ code generation, the GenAI models’ test codes and human-written test codes were compared by first transforming the test codes into semantic representations, then converting them into embeddings, and finally, calculating the embedding similarity metrics of Euclidean distance, cosine similarity and dot product. These were calculated by measuring individual code similarities for model- and human-written test codes for each source code case and for each prompting method separately. Then, these values were averaged across source code cases, per prompting method, to produce the final average embedding similarity measures. The results of this comparison can be seen in Table 5, Table 6, Table 7 and Table 8.
To interpret these results, it is necessary to consider that embedding similarity rises with a decrease in Euclidean distance (a higher distance implies lower similarity) and rises with an increase in cosine similarity and dot product (higher metric values imply higher similarity).
Table 5 presents the embedding similarity results of the baseline-prompted models’ generated tests to those of each other and to human-written ones. Inspecting the Euclidean distance, cosine similarity and dot product metrics, it is evident that the Claude, Llama and Mistral models produced tests that were similar to each other, especially those of Llama and Mistral, with the lowest distance of 8.19. The similarities of Claude–Llama and Claude–Mistral were high and nearly equal—e.g., comparing cosine similarity, the similarities were 0.79 and 0.80. On the other hand, the models’ tests seemed to all be similarly different to human-written tests. Among these, Claude’s tests were the most similar to human-written ones. As such, we can say that with baseline prompting, the models’ tests seemed to be generated in a highly similar way, but distinctly differently from human tests. Claude produced the most similar tests to humans, and Llama generated the tests that were least similar to the programmers’.
Table 6 summarizes the results of embedding similarity across rich-instruction-prompted LLM models’ and programmers’ tests. Similarly to the results of baseline prompting, the Claude and Mistral models’ tests are more similar to those of the other LLM models than the human-written ones. Comparing one of the similarity metrics, Euclidean distance, it can be seen that Claude–Human and Mistral–Human similarities reach high values, while the Claude–LLM and Mistral–LLM distances are lower. Nonetheless, a difference can be observed with the generated tests for Llama, whereby with rich-instruction prompting, it generated tests that were highly similar to human tests, and not so different from the other models’ tests either.
Table 7 compares the similarities between the programmers’ and example-prompted models’ tests. It is clear that with this prompting method, the generated tests were more dissimilar to the tests generated with other prompting methods, as well as human-written tests, with the Euclidean distance being in the range of 8.99 to 10.34, significantly higher than the similarity values for the baseline and rich-instruction methods, ranging from 8.19 to 9.33 and from 8.30 to 9.00, respectively. It can be deduced that contrary to the logic that LLMs would produce similar code to humans, having been given a human-written test, it seems that this was not true for any of the three models.
Table 8 reports the results of the embedding similarities for human-written and chain-of-thought-prompted LLM-generated test codes. In contrast to the other prompting methods, chain-of-thought prompting resulted in LLM models producing more similar tests to humans than to other LLMs in the case of Claude–Llama and Claude–Mistral comparisons. This suggests that chain-of-thought is the ideal prompting method to generate semantically similar tests to humans. Nonetheless, as seen in previous comparisons of execution failure and code coverage, none of the models were able to generate code as good as that of humans. As such, it can be inferred that code merely being similar to human-written code does not guarantee better quality or functionality.
Firstly, in the analysis of the similarities between generated and human-written test codes, the Euclidean distance across models and prompting methods varied between 8.32 and 9.48. This difference is especially evident in the case of the rich-instruction prompting method, where the values for all models were relatively low, indicating high similarity. For the Llama and Mistral models, the values associated with this prompting method were the lowest, suggesting that these models perform better at the task when given rich instructions, more so than with example, chain-of-thought or baseline prompting. Claude generated test code most like human code under the baseline instructions. The other two metrics, cosine similarity and dot product, offer similar conclusions, whereby the highest cosine similarity was measured for the Llama and Mistral models in the rich-instruction prompting method, while Claude performed equally well in the rich-instruction and baseline methods. Cosine similarity values ranged from −1 to 1; therefore, all of the embedding similarities between models can be considered to be relatively high, indicating that the tests produced by humans and models are rather similar. However, it needs to be considered that the LLM-generated and human-written tests were first transformed into code representations, as detailed in the Similarity of GenAI- and Human-Written Test Code subsection of the Experiments section, which likely caused the tests to grow in similarity.
Secondly, in the analysis of the similarities among the generated test codes themselves, overall, the generated test codes were more similar to each other than to the human-written ones. For example, given the chain-of-thought prompting method, Table 8 shows that while the model–human similarity values for Euclidean distance were 9.48, 9.30 and 9.44, the model–model similarity was lower, with values of 7.75, 8.81 and 9.14. This shows how the type of prompting influenced how similar the generated code was to the human-written code. Overall, the generated code was the most similar to human code for the prompting method with rich instructions, followed by baseline prompting, example prompting and chain-of-thought prompting. This indicates that to achieve the goal of generating tests which are semantically similar to human-written code, one should employ rich-instruction over, e.g., chain-of-thought prompting, which performed most poorly in this case.
It is also necessary to consider the fact that the tests generated by AI models were always written with unittest, while the human codes were written with either unittest, pytest or assert statements. These different methods impact the structure of the code, which may have contributed to the fact that the generated codes were similar to each other, more so than to human codes. Moreover, while generative AI preferred to generate codes in an explicit step-wise pattern (to separately arrange value assignment, the act of running the code and assigning it to a result and asserting the expected result over multiple code lines), human codes often were structured more efficiently, where all three actions could be performed in a simple statement in only one code line. These two points give some insight into possible reasons why the generated codes were more similar to each other than to human code. However, this does not offer any clear insight into their quality.
In order to evaluate the statistical significance of differences in Euclidean distances between the model-generated and human-written outputs, we conducted a Wilcoxon signed-rank test across prompting strategies (Table 9). We compared pairs of models (among Claude, Llama and Mistral) against human references under four prompting methods. The results revealed that under all of the prompting instructions, there was a significant difference between most of the pairs, indicating that the choice of Claude, Llama, Mistral or a human agent matters when it comes to generating test codes. However, no significant differences were found for the pairs Claude–Human and Llama–Human under baseline (p = 0.77) and chain-of-thought (p = 0.09) instructions, signifying that under these instructions, it does not make a difference whether Claude or Llama is chosen over a human agent. One possible explanation could be that both Claude and Llama are similarly aligned with human-like reasoning under minimal or structured-reasoning prompts. On the other hand, for the pairs Claude–Human and Mistral–Human, a lack of statistical difference can be seen under the rich-instruction (p = 0.16) and example (p = 0.37) prompting methods, suggesting that with these types of prompting styles, the difference between using Claude or Mistral over a human agent will not be impactful. The reason for this may be that richer or example-based prompts help standardize the output structure across models, reducing variability. Finally, in the Claude–Llama and Claude–Mistral pairs, only in the chain-of-thought prompt did it matter whether Llama or Mistral was chosen over Claude (p = 0.25), which could be explained by the increased cognitive load and reasoning depth required by this prompt type, potentially amplifying model-specific differences. Overall, these findings underscore that the choice of agent (whether Claude, Llama, Mistral or a human) plays a primary role in determining the quality and alignment of test code generation, while prompt style serves as a secondary factor that can modulate the inherent differences between agents.
Table 10 presents the results of the Kruskal–Wallis H-test performed on the Euclidean distance between embeddings of samples obtained across different methods of prompting. The test was conducted to determine whether the distributions of distances between embeddings vary significantly depending on the combination of agents involved, including the models—Claude, Llama, Mistral—and humans. For each prompting method— baseline, rich instructions, example, and chain-of-thought —the embedding distances were grouped by model pairings (e.g., Claude–Llama, Claude–Human), and we conducted a non-parametric H-test to assess statistical differences between these groups. For baseline, rich instructions, and chain-of-thought, multiple comparisons yielded p-values below the 0.05 threshold, indicating statistically significant variation in embedding distances across the model combinations. In contrast, the example prompt stood out as the only condition where no significant differences were found, suggesting that this prompt type leads to more consistent embedding representations across models. This similarity can be explained by the characteristic of the used prompt technique—all models with the example prompt follow the arrange–act–assert pattern. The largest discrepancies were observed in comparisons involving human responses, which consistently produced higher H-values. This suggests that model-generated embeddings differ more noticeably from human-produced embeddings than from each other.

4. Discussion

This study explored the capabilities of LLMs in generating test codes and compared them to human performance. The main finding is a clear decrease in LLM effectiveness as task difficulty increases. For simple generation tasks requiring little domain expertise, LLMs performed comparably to humans. However, their performance dropped significantly for more complex problems, highlighting a performance gap between GenAI and human testers.
It is important to recognize that this study did not simulate real-world industrial scenarios. The source code used was largely adapted from educational examples or non-professional contributions, and sometimes even AI-assisted. In practice, production code tends to be significantly more intricate, and LLMs may not generalize well to such complexity. Future work should extend these findings by assessing GenAI’s performance on real-world codebases.
One of the most promising directions is the integration of human feedback into both evaluation and test generation. Developers can assess AI-generated tests for correctness, coverage and fault detection—especially edge cases—and guide models iteratively to improve outputs. This feedback-based approach aligns with emerging software co-development practices, where code generation and testing form a tight feedback loop [29,30].
Although the models used in this study were not fine-tuned for the specific task, they followed coding conventions, likely due to pre-training on large-scale repositories. While this enables them to mimic common patterns, it limits their effectiveness for domain-specific or uncommon logic.
In practice, GenAI test generation could be embedded into developer environments using tools like GitHub Copilot, providing not just code suggestions but also real-time test generation. This could make the testing process more fluid and integrated.
Ultimately, more research is needed on incorporating GenAI into real-world programming workflows, especially for complex or critical systems.

5. Conclusions

This study assessed the potential of generative AI, specifically LLMs, in automating unit test generation in Python. Python was chosen due to its popularity and extensive use in LLM training corpora, which means code generation in this language should be well within the models’ capabilities. Nevertheless, our results reveal clear limitations.
We used state-of-the-art foundation models via Amazon Bedrock and explored multiple prompt-engineering approaches. The quality and similarity of AI-generated test cases were analyzed using embeddings.
Our results show that LLMs are capable of producing effective tests only for very easy and easy code cases. Their performance drops considerably when facing medium or hard problems, both in terms of execution success and code coverage. Increasing prompt complexity does not consistently improve test quality, suggesting that more context does not guarantee better results.
A notable insight from our embedding-based similarity analysis is the lack of diversity in AI-generated test outputs. While structurally similar to each other, these tests are semantically different from human-written ones, limiting their adaptability in diverse testing scenarios.
Thus, to increase the practical utility of LLMs in software testing, future improvements are necessary in three main areas: fine-tuning models for testing tasks, designing more targeted prompts, and embedding human feedback loops. Until then, LLMs should be considered a supplementary tool for basic prototyping rather than a standalone solution for quality assurance in complex systems.

Author Contributions

Conceptualization, M.K. and J.S.; methodology, J.S.; software, M.A. and N.D.; validation, M.A. and N.D.; formal analysis, N.D.; investigation, M.A.; resources, M.K.; data curation, N.D. and J.P.; writing—original draft preparation, M.A. and N.D. writing—review and editing, M.A. and N.D.; visualization, M.A. and N.D.; supervision, J.S.; project administration, J.S.; funding acquisition, M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Capgemini, grant number A01. The APC was funded by Capgemini.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author. Due to legal and contractual restrictions, the code cannot be made publicly available. Specifically, the source and test code pairs were developed/modified during working hours as part of employment at Capgemini, and are therefore subject to the company’s intellectual property rights and confidentiality agreements.

Acknowledgments

The authors would like to thank Mateusz Sztukowski for providing financial and organizational support, which enabled the development of this work. We also thank Bartosz Chowański for his support and for facilitating the organizational framework. We further acknowledge Łukasz Polakiewicz for his valuable contributions during the initial phase of the article.

Conflicts of Interest

Authors Marcin Andrzejewski, Nina Dubicka, Jędrzej Podolak, Marek Kowal and Jakub Siłka are employees at Capgemini, which is the funder of this research. The funder had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
  2. Runeson, P. A survey of unit testing practices. IEEE Softw. 2006, 23, 22–29. [Google Scholar] [CrossRef]
  3. Olan, M. Unit testing: Test early, test often. J. Comput. Sci. Coll. 2003, 19, 319–328. [Google Scholar]
  4. McMinn, P. Search-Based Software Testing: Past, Present and Future. In Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, Berlin, Germany, 21–25 March 2011; pp. 153–163. [Google Scholar] [CrossRef]
  5. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
  6. Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv 2024, arXiv:2406.00515. [Google Scholar]
  7. Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing with Large Language Models: Survey, Landscape, and Vision. arXiv 2024, arXiv:2307.07221. [Google Scholar]
  8. Pizzorno, J.A.; Berger, E.D. CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv 2024, arXiv:2403.16218. [Google Scholar]
  9. Ryan, G.; Jain, S.; Shang, M.; Wang, S.; Ma, X.; Ramanathan, M.K.; Ray, B. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. arXiv 2024, arXiv:2402.00097. [Google Scholar]
  10. Wong, M.F.; Tan, C.W. Aligning Crowd-Sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models. arXiv 2025, arXiv:2503.15129. [Google Scholar]
  11. Chen, B.; Zhang, F.; Nguyen, A.; Zan, D.; Lin, Z.; Lou, J.G.; Chen, W. CodeT: Code Generation with Generated Tests. arXiv 2022, arXiv:2207.10397. [Google Scholar]
  12. Bi, Z.; Zhang, N.; Jiang, Y.; Deng, S.; Zheng, G.; Chen, H. When Do Program-of-Thoughts Work for Reasoning? arXiv 2023, arXiv:2308.15452. [Google Scholar]
  13. Astels, D. Test Driven Development: A Practical Guide; Prentice Hall Professional Technical Reference: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
  14. Reese, J. Unit Testing Best Practices for NET. Microsoft Learn. 2025. Available online: https://learn.microsoft.com/en-us/dotnet/core/testing/unit-testing-best-practices (accessed on 12 April 2025).
  15. Website: Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. 2024. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 4 July 2024).
  16. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
  17. Team, V. Website: *Llama 3 8B vs. Mistral 7B: Small LLM Pricing Considerations*. 2024. Available online: https://www.vantage.sh/blog/best-small-llm-llama-3-8b-vs-mistral-7b-cost (accessed on 4 July 2024).
  18. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card. 2024. Available online: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf (accessed on 6 April 2025).
  19. Ivanković, M.; Petrović, G.; Just, R.; Fraser, G. Code coverage at Google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; pp. 955–963. [Google Scholar]
  20. Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359. [Google Scholar] [CrossRef]
  21. Chen, Z.; Monperrus, M. A literature study of embeddings on source code. arXiv 2019, arXiv:1904.03061. [Google Scholar]
  22. Kenter, T.; De Rijke, M. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 1411–1420. [Google Scholar]
  23. Pouly, M. Estimating Text Similarity based on Semantic Concept Embeddings. arXiv 2024, arXiv:2401.04422. [Google Scholar]
  24. Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In Proceedings of the Companion Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024; pp. 887–890. [Google Scholar]
  25. Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. arXiv 2023, arXiv:2310.14735. [Google Scholar]
  26. Shanahan, M.; McDonell, K.; Reynolds, L. Role play with large language models. Nature 2023, 623, 493–498. [Google Scholar] [CrossRef]
  27. Logan IV, R.L.; Balažević, I.; Wallace, E.; Petroni, F.; Singh, S.; Riedel, S. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv 2021, arXiv:2106.13353. [Google Scholar]
  28. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  29. Beer, R.; Feix, A.; Guttzeit, T.; Muras, T.; Müller, V.; Rauscher, M.; Schäffler, F.; Löwe, W. Examination of Code generated by Large Language Models. arXiv 2024, arXiv:2408.16601. [Google Scholar]
  30. Wang, Y.; Guo, S.; Tan, C.W. From code generation to software testing: AI Copilot with context-based RAG. IEEE Softw. 2025, 42, 34–42. [Google Scholar] [CrossRef]
Figure 1. The process of creating the database. First, a set of code cases was obtained, which was validated by the team. Then, the tests were created. Based on the same code cases, tests were also generated using models. Finally, all the data was transformed into a separate database, which was subjected to further testing.
Figure 1. The process of creating the database. First, a set of code cases was obtained, which was validated by the team. Then, the tests were created. Based on the same code cases, tests were also generated using models. Finally, all the data was transformed into a separate database, which was subjected to further testing.
Data 10 00156 g001
Figure 2. Method of obtaining the test similarity values. The tests initially generated by both the human and the models, shown above in yellow, were sent to another model (orange), which modified their syntax so that code fragments performing similar tasks were replaced by semantically similar words, after which centroids were calculated, and then the final distances from them were measured.
Figure 2. Method of obtaining the test similarity values. The tests initially generated by both the human and the models, shown above in yellow, were sent to another model (orange), which modified their syntax so that code fragments performing similar tasks were replaced by semantically similar words, after which centroids were calculated, and then the final distances from them were measured.
Data 10 00156 g002
Figure 3. Human and AI test performance (average code coverage) across test difficulty categories for different prompting strategies. (a) Baseline prompt. (b) Rich-instruction prompt. (c) Example prompt. (d) Chain-of-thought prompt.
Figure 3. Human and AI test performance (average code coverage) across test difficulty categories for different prompting strategies. (a) Baseline prompt. (b) Rich-instruction prompt. (c) Example prompt. (d) Chain-of-thought prompt.
Data 10 00156 g003
Table 1. Execution failure and code coverage statistics of models with baseline prompting compared to human performance.
Table 1. Execution failure and code coverage statistics of models with baseline prompting compared to human performance.
ModelDifficultyExecution
Failure [%]
MeanCode
Coverage
[%] Median
SD
ClaudeVery Easy0.0091.63100.0019.68
Easy0.0084.4691.9519.91
Medium0.0056.6351.0216.84
Hard0.0029.1530.485.91
LlamaVery Easy0.0097.83100.004.57
Easy0.0086.4392.6716.16
Medium11.0046.8353.2625.07
Hard0.0028.7029.096.07
MistralVery Easy5.0087.42100.0028.05
Easy0.0079.5384.0618.86
Medium0.0066.3460.0020.41
Hard0.0027.3129.097.53
HumanVery Easy0.0096.04100.009.61
Easy0.0086.53100.0018.71
Medium0.0081.9779.5911.96
Hard0.0055.4043.6428.45
Table 2. Execution failure and code coverage statistics of models with rich-instruction prompting compared to human performance.
Table 2. Execution failure and code coverage statistics of models with rich-instruction prompting compared to human performance.
ModelDifficultyExecution
Failure [%]
MeanCode
Coverage
[%] Median
SD
ClaudeVery Easy0.0079.63100.0032.04
Easy0.0079.9482.1318.32
Medium0.0057.5051.0226.53
Hard0.0030.0530.885.10
LlamaVery Easy0.0092.11100.0012.84
Easy0.0087.5491.9514.06
Medium0.0056.5952.5416.93
Hard0.0026.9829.098.66
MistralVery Easy5.0082.05100.0032.04
Easy0.0083.9386.0514.10
Medium11.1140.5146.0129.93
Hard0.0031.3630.484.78
HumanVery Easy0.0096.04100.009.61
Easy0.0086.53100.0018.71
Medium0.0081.9779.5911.96
Hard0.0055.4043.6428.45
Table 3. Execution failure and code coverage statistics of models with example prompting compared to human performance.
Table 3. Execution failure and code coverage statistics of models with example prompting compared to human performance.
ModelDifficultyExecution
Failure [%]
MeanCode
Coverage
[%] Median
SD
ClaudeVery Easy10.0081.08100.0033.07
Easy0.0084.4190.9816.28
Medium11.1146.0452.9425.36
Hard0.0028.7029.096.07
LlamaVery Easy5.0087.14100.0025.53
Easy0.0080.0481.2515.01
Medium0.0065.4956.7823.36
Hard0.0027.5625.006.19
MistralVery Easy0.0084.93100.0023.78
Easy0.0080.0481.2515.01
Medium0.0053.3350.8522.83
Hard0.0027.3129.097.53
HumanVery Easy0.0096.04100.009.61
Easy0.0086.53100.0018.71
Medium0.0081.9779.5911.96
Hard0.0055.4043.6428.45
Table 4. Execution failure and code coverage statistics of models with chain-of-thought prompting compared to human performance.
Table 4. Execution failure and code coverage statistics of models with chain-of-thought prompting compared to human performance.
ModelDifficultyExecution
Failure [%]
MeanCode
Coverage
[%] Median
SD
ClaudeVery Easy0.0088.16100.0022.78
Easy0.0080.5390.4421.78
Medium44.4437.5838.7740.28
Hard0.0028.7930.487.41
LlamaVery Easy0.0088.57100.0022.07
Easy0.0073.7175.2618.03
Medium11.1145.5848.8520.79
Hard0.0027.0329.097.14
MistralVery Easy0.0089.35100.0018.55
Easy0.0080.0485.6518.77
Medium22.2239.2238.7830.17
Hard0.0029.9329.095.47
HumanVery Easy0.0096.04100.009.61
Easy0.0086.53100.0018.71
Medium0.0081.9779.5911.96
Hard0.0055.4043.6428.45
Table 5. Average embedding similarity measures of test code written by models with baseline prompting and humans: Euclidean distance, cosine similarity and dot product.
Table 5. Average embedding similarity measures of test code written by models with baseline prompting and humans: Euclidean distance, cosine similarity and dot product.
ClaudeLlamaMistralHuman
Claude 8.40, 0.80, 150.808.50, 0.79, 148.648.80, 0.77, 136.41
Llama 8.19, 0.82, 156.499.33, 0.75, 139.83
Mistral 9.07, 0.76, 138.47
Human
Table 6. Average embedding similarity measures of test code written by models with rich-instruction prompting and humans: Euclidean distance, cosine similarity and dot product.
Table 6. Average embedding similarity measures of test code written by models with rich-instruction prompting and humans: Euclidean distance, cosine similarity and dot product.
ClaudeLlamaMistralHuman
Claude 8.57, 0.78, 141.608.99, 0.76, 139.159.00, 0.76, 133.62
Llama 8.30, 0.80, 147.588.32, 0.80, 143.36
Mistral 8.65, 0.78, 142.82
Human
Table 7. Average embedding similarity measures of test code written by models with example prompting and humans: Euclidean distance, cosine similarity and dot product.
Table 7. Average embedding similarity measures of test code written by models with example prompting and humans: Euclidean distance, cosine similarity and dot product.
ClaudeLlamaMistralHuman
Claude 9.48, 0.73, 129.8810.34, 0.70,
126.98
9.30, 0.75, 131.36
Llama 8.99, 0.77, 145.909.23, 0.76, 138.43
Mistral 9.44, 0.75, 139.26
Human
Table 8. Average embedding similarity measures of test code written by models with chain-of-thought prompting and humans: Euclidean distance, cosine similarity and dot product.
Table 8. Average embedding similarity measures of test code written by models with chain-of-thought prompting and humans: Euclidean distance, cosine similarity and dot product.
ClaudeLlamaMistralHuman
Claude 7.75, 0.81, 146.018.81, 0.77, 141.699.48, 0.74, 133.83
Llama 9.14, 0.76, 138.389.30, 0.75, 134.56
Mistral 9.44, 0.75, 138.40
Human
Table 9. The p-values of the Wilcoxon signed-rank test for Euclidean distance between pairs of model- and human-written code.
Table 9. The p-values of the Wilcoxon signed-rank test for Euclidean distance between pairs of model- and human-written code.
PromptFirst PairSecond Pairp-Value
BaselineClaude–HumanLlama–Human0.77
Claude–HumanMistral–Human<0.001
Llama–HumanMistral–Human<0.001
Claude–LlamaClaude–Mistral<0.001
Rich InstructionsClaude–HumanLlama–Human< 0.001
Claude–HumanMistral–Human0.16
Llama–HumanMistral–Human<0.001
Claude–LlamaClaude–Mistral<0.001
ExampleClaude–HumanLlama–Human0.02
Claude–HumanMistral–Human0.37
Llama–HumanMistral–Human0.03
Claude–LlamaClaude–Mistral0.01
Chain-of-ThoughtClaude–HumanLlama–Human0.09
Claude–HumanMistral–Human0.01
Llama–HumanMistral–Human<0.001
Claude–LlamaClaude–Mistral0.25
Table 10. Theresults of the Kruskal–Wallis H-test for samples regarding distance between embeddings.
Table 10. Theresults of the Kruskal–Wallis H-test for samples regarding distance between embeddings.
PromptSamples fromValuep-Value
BaselineClaude–Llama, Claude–Mistral, Llama–Mistral62.720.02
Claude–Human, Llama–Human, Mistral–Human76.15<0.001
All of the above84.11<0.001
Rich InstructionsClaude–Llama, Claude–Mistral, Llama–Mistral68.86<0.001
Claude–Human, Llama–Human, Mistral–Human52.040.22
All of the above57.950.04
ExampleClaude–Llama, Claude–Mistral, Llama–Mistral33.340.12
Claude–Human, Llama–Human, Mistral–Human50.870.25
All of the above29.500.24
Chain-of-ThoughtClaude–Llama, Claude–Mistral, Llama–Mistral47.130.01
Claude–Human, Llama–Human, Mistral–Human77.21<0.001
All of the above44.390.03
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Andrzejewski, M.; Dubicka, N.; Podolak, J.; Kowal, M.; Siłka, J. Automated Test Generation Using Large Language Models. Data 2025, 10, 156. https://doi.org/10.3390/data10100156

AMA Style

Andrzejewski M, Dubicka N, Podolak J, Kowal M, Siłka J. Automated Test Generation Using Large Language Models. Data. 2025; 10(10):156. https://doi.org/10.3390/data10100156

Chicago/Turabian Style

Andrzejewski, Marcin, Nina Dubicka, Jędrzej Podolak, Marek Kowal, and Jakub Siłka. 2025. "Automated Test Generation Using Large Language Models" Data 10, no. 10: 156. https://doi.org/10.3390/data10100156

APA Style

Andrzejewski, M., Dubicka, N., Podolak, J., Kowal, M., & Siłka, J. (2025). Automated Test Generation Using Large Language Models. Data, 10(10), 156. https://doi.org/10.3390/data10100156

Article Metrics

Back to TopTop