Automated Test Generation Using Large Language Models
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset
- Hard: These codes require a deep understanding of complex mathematical concepts, such as the ability to design and test fractal dimensionality, along with advanced implementation skills in using Python libraries tailored for complex computations and visualizations.
- Medium: These codes demand domain-specific knowledge to create appropriate test cases. For example, testing the logic of a chess game requires understanding the game’s rules. Additionally, these codes often involve mocking certain parts of the logic, like I/O operations, to ensure unit tests run in isolation, following best practices [14].
- Easy and Very Easy: The remaining codes were classified based on their use of programming concepts like abstraction, inheritance or design patterns.
- Imports of libraries which are used in the program.
- Functions or classes: the core logic of the program. Each is defined with the clear purpose of performing a specific task.
- Imports of
- The libraries (such as unittest or pytest) required for testing;
- The libraries which are used in the test;
- The source code.
- Test cases:
- In the case of using the unittest library:
- –
- Style: Formal and object-oriented.
- –
- Structure: Tests are organized into classes that inherit fromunittest.TestCase. Each method within the class corresponds to a specific use case.
- –
- Assertions: Use of assertion methods such asself.assertEqual, self.assertTrue, etc.
- –
- Advantages: Well-organized and grouped tests, suitable for complex test suites, with built-in support for test discovery and reporting.
- –
- Disadvantages: More verbose and requires additional boilerplate.
- In the case of using the pytest library or plain assert statements:
- –
- Style: Concise and declarative.
- –
- Structure: Tests are written as standalone functions, each beginning with the prefix test_.
- –
- Assertions: Use of Python’s built-in assert statement for validating expected outcomes.
- –
- Advantages: Minimal boilerplate, easy to read and write.
- –
- Disadvantages: Less structured and lacks the formality required in large-scale projects.
- Script execution clause.
2.2. Models
- Llama3 70B: According to the authors [15], this LLM achieves state-of-the-art results on commonly used benchmarks, including code generation. It has the advantage of being open-source, thus providing transparency in terms of how it works, and it can potentially be hosted on premise.
- Mistral Large: Mistral models, similarly to Llama models, are open-source, and their license allows for commercial usage. The authors state that these models, including the smallest available model, whose performance is comparable to that of larger models [16], can achieve state-of-the-art results. In practical terms, the smallest model only has 7 billion parameters, which may be more cost-effective in large-scale projects. According to external research [17], Mistral 7B is also almost two-thirds less costly than the competitive Llama 8B in the case of using Bedrock, allowing for greater accessibility.
- Claude 3 Sonnet: According to the authors [18], this LLM is capable of achieving state-of-the-art results; in the benchmarks of the authors, it is stated that it outperforms one of the most famous LLMs—GPT4.
2.3. Experiments
2.3.1. Code Evaluation
- Code coverage—This is a quantitative measure of how much of the source code was executed when the test code was run. It provides insight into the effectiveness of the test code in executing different parts of the code. It is a widely used metric in software engineering research [19] as well as in research regarding the effectiveness of LLMs in generating unit tests [7]. In this experiment, the test codes of GenAI-generated and human-written test codes were instrumented using Coverage.py in conjunction with pytest in Python. The generated and written tests were checked against the same source code and the coverage tool generated reports detailing the percentage of source code lines that were covered. The coverage percentages were then further analyzed with statistics (mean, median, and standard deviation) for each prompting method as well as each GenAI model/human author and difficulty.
- Execution Failure—In addition to code coverage, the number of failures was counted. This metric reflects the number of test cases that fail to execute due to errors in the test code itself. It has been used in research to evaluate the performance of LLMs [20]. Execution failure was measured using the same process as the coverage measure described above. If the coverage measure did not return a result for any test case, or if the code coverage was 0, this was due to errors in the test code, and the test case for the specific model was labeled as having exhibited execution failure.
2.3.2. Embedding Similarity
- Euclidean distance—This measures the straight-line difference between two points in the multi-dimensional space of embeddings. It captures how far the embeddings are from each other—the further away they are, the more different they are. Euclidean distance is used as a basic metric for measuring text embedding similarity, among other uses [22].
- Cosine similarity—To support the similarity measure, cosine similarity was also calculated between the embeddings. It measures the cosine of the angle between two embedding vectors and is one of the most commonly used metrics to compare text similarity based on semantic embeddings [23]. It provides insight into how similar the directions of the embeddings are. Cosine similarity ranges from −1 to 1, where 1 means identical directions and −1 opposite directions. As such, higher cosine similarity values suggest the compared embeddings are more similar.
- Dot product—In addition to cosine similarity, a similar measure is used—the dot product. This measures the magnitude of embedding overlap. It is similar to cosine similarity, but also considers the length of vectors [24]. A higher dot product suggests the two compared embeddings are more similar.
2.4. Prompts
| Listing 1. The baseline prompt. |
| You are a programming assistant. Your task is to provide a unit test code snippet to the source code below. Please import the source code as ’source_code’ module in the code snippet. {source code} |
| Listing 2. Example of issue with importing the source code. |
| from your_module import x, y, z # Replace with the actual module name |
| Listing 3. The prompt with rich instructions. |
| Act as a Python software test engineer. Your task is to provide a unit test code snippet to the source code below. Please import the source code as ’source_code’ module in the code snippet. Each test must have the following parts: (1) Arrange: create and set necessary objects / data for the test (2) Act: call the tested method from source code and get the actual value (3) Assert: check the expected and actual values. Don’t forget to test edge cases and all possible exceptions which can be raised. <source_code > {source_code} </source_code > |
| Listing 4. The few-shot prompt. |
| Act as a Python software test engineer. Your task is to provide a unit test code snippet to the source code below. You also get a example of a unit test. Please import the source code as ’source_code’ module in the code snippet. Each test must have the following parts: (1) Arrange: create and set necessary objects / data for the test. (2) Act: call the tested method from source code and get the actual value. (3) Assert: check expected and actual values. Don’t forget to test edge cases and all possible exceptions which can be raised. <example > {example} </example > <source_code > {source_code} </source_code > |
| Listing 5. The chain-of-thought prompt. |
| Act as a Python software test engineer. Your task is to provide the unit test code snippet to the source code below: <source_code > {example of the source code} </source_code > The source code is located in the separate file, so you must remember to import the source code as ’source_code’ module in the code snippet. First you must check if the tested function works correctly in the following way: (1) Arrange: create and set necessary objects / data for the test. (2) Act: call the tested method from source code and get the actual value. (3) Assert: check expected and actual values. Then you must also check the edge cases and all possible exceptions which can be raised. Edge cases depend on the tested source code logic. In this case, it’s wise to check the wrong type of the argument and if the written exception is raised correctly. Other cases worth checking are when the input values are outside the range, a null value or an empty list/dict. To sum up, the final unit test code snippet should look like this: <example > {example of the unit tests } </example > Your task is to provide a unit test code snippet to the source code below: <source_code > {source code} </source_code > |
3. Results
3.1. Generative AI and Human-Written Code Evaluation
3.2. Similarity of GenAI and Human-Written Test Code
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2024, arXiv:2402.06196. [Google Scholar]
- Runeson, P. A survey of unit testing practices. IEEE Softw. 2006, 23, 22–29. [Google Scholar] [CrossRef]
- Olan, M. Unit testing: Test early, test often. J. Comput. Sci. Coll. 2003, 19, 319–328. [Google Scholar]
- McMinn, P. Search-Based Software Testing: Past, Present and Future. In Proceedings of the 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops, Berlin, Germany, 21–25 March 2011; pp. 153–163. [Google Scholar] [CrossRef]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Jiang, J.; Wang, F.; Shen, J.; Kim, S.; Kim, S. A Survey on Large Language Models for Code Generation. arXiv 2024, arXiv:2406.00515. [Google Scholar]
- Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing with Large Language Models: Survey, Landscape, and Vision. arXiv 2024, arXiv:2307.07221. [Google Scholar]
- Pizzorno, J.A.; Berger, E.D. CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv 2024, arXiv:2403.16218. [Google Scholar]
- Ryan, G.; Jain, S.; Shang, M.; Wang, S.; Ma, X.; Ramanathan, M.K.; Ray, B. Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM. arXiv 2024, arXiv:2402.00097. [Google Scholar]
- Wong, M.F.; Tan, C.W. Aligning Crowd-Sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models. arXiv 2025, arXiv:2503.15129. [Google Scholar]
- Chen, B.; Zhang, F.; Nguyen, A.; Zan, D.; Lin, Z.; Lou, J.G.; Chen, W. CodeT: Code Generation with Generated Tests. arXiv 2022, arXiv:2207.10397. [Google Scholar]
- Bi, Z.; Zhang, N.; Jiang, Y.; Deng, S.; Zheng, G.; Chen, H. When Do Program-of-Thoughts Work for Reasoning? arXiv 2023, arXiv:2308.15452. [Google Scholar]
- Astels, D. Test Driven Development: A Practical Guide; Prentice Hall Professional Technical Reference: Upper Saddle River, NJ, USA, 2003. [Google Scholar]
- Reese, J. Unit Testing Best Practices for NET. Microsoft Learn. 2025. Available online: https://learn.microsoft.com/en-us/dotnet/core/testing/unit-testing-best-practices (accessed on 12 April 2025).
- Website: Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. 2024. Available online: https://ai.meta.com/blog/meta-llama-3/ (accessed on 4 July 2024).
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Team, V. Website: *Llama 3 8B vs. Mistral 7B: Small LLM Pricing Considerations*. 2024. Available online: https://www.vantage.sh/blog/best-small-llm-llama-3-8b-vs-mistral-7b-cost (accessed on 4 July 2024).
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Claude-3 Model Card. 2024. Available online: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf (accessed on 6 April 2025).
- Ivanković, M.; Petrović, G.; Just, R.; Fraser, G. Code coverage at Google. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; pp. 955–963. [Google Scholar]
- Tang, Y.; Liu, Z.; Zhou, Z.; Luo, X. Chatgpt vs sbst: A comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 2024, 50, 1340–1359. [Google Scholar] [CrossRef]
- Chen, Z.; Monperrus, M. A literature study of embeddings on source code. arXiv 2019, arXiv:1904.03061. [Google Scholar]
- Kenter, T.; De Rijke, M. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 18–23 October 2015; pp. 1411–1420. [Google Scholar]
- Pouly, M. Estimating Text Similarity based on Semantic Concept Embeddings. arXiv 2024, arXiv:2401.04422. [Google Scholar]
- Steck, H.; Ekanadham, C.; Kallus, N. Is cosine-similarity of embeddings really about similarity? In Proceedings of the Companion Proceedings of the ACM on Web Conference 2024, Singapore, 13–17 May 2024; pp. 887–890. [Google Scholar]
- Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering in Large Language Models: A comprehensive review. arXiv 2023, arXiv:2310.14735. [Google Scholar]
- Shanahan, M.; McDonell, K.; Reynolds, L. Role play with large language models. Nature 2023, 623, 493–498. [Google Scholar] [CrossRef]
- Logan IV, R.L.; Balažević, I.; Wallace, E.; Petroni, F.; Singh, S.; Riedel, S. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv 2021, arXiv:2106.13353. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Beer, R.; Feix, A.; Guttzeit, T.; Muras, T.; Müller, V.; Rauscher, M.; Schäffler, F.; Löwe, W. Examination of Code generated by Large Language Models. arXiv 2024, arXiv:2408.16601. [Google Scholar]
- Wang, Y.; Guo, S.; Tan, C.W. From code generation to software testing: AI Copilot with context-based RAG. IEEE Softw. 2025, 42, 34–42. [Google Scholar] [CrossRef]



| Model | Difficulty | Execution Failure [%] | Mean | Code Coverage [%] Median | SD |
|---|---|---|---|---|---|
| Claude | Very Easy | 0.00 | 91.63 | 100.00 | 19.68 |
| Easy | 0.00 | 84.46 | 91.95 | 19.91 | |
| Medium | 0.00 | 56.63 | 51.02 | 16.84 | |
| Hard | 0.00 | 29.15 | 30.48 | 5.91 | |
| Llama | Very Easy | 0.00 | 97.83 | 100.00 | 4.57 |
| Easy | 0.00 | 86.43 | 92.67 | 16.16 | |
| Medium | 11.00 | 46.83 | 53.26 | 25.07 | |
| Hard | 0.00 | 28.70 | 29.09 | 6.07 | |
| Mistral | Very Easy | 5.00 | 87.42 | 100.00 | 28.05 |
| Easy | 0.00 | 79.53 | 84.06 | 18.86 | |
| Medium | 0.00 | 66.34 | 60.00 | 20.41 | |
| Hard | 0.00 | 27.31 | 29.09 | 7.53 | |
| Human | Very Easy | 0.00 | 96.04 | 100.00 | 9.61 |
| Easy | 0.00 | 86.53 | 100.00 | 18.71 | |
| Medium | 0.00 | 81.97 | 79.59 | 11.96 | |
| Hard | 0.00 | 55.40 | 43.64 | 28.45 |
| Model | Difficulty | Execution Failure [%] | Mean | Code Coverage [%] Median | SD |
|---|---|---|---|---|---|
| Claude | Very Easy | 0.00 | 79.63 | 100.00 | 32.04 |
| Easy | 0.00 | 79.94 | 82.13 | 18.32 | |
| Medium | 0.00 | 57.50 | 51.02 | 26.53 | |
| Hard | 0.00 | 30.05 | 30.88 | 5.10 | |
| Llama | Very Easy | 0.00 | 92.11 | 100.00 | 12.84 |
| Easy | 0.00 | 87.54 | 91.95 | 14.06 | |
| Medium | 0.00 | 56.59 | 52.54 | 16.93 | |
| Hard | 0.00 | 26.98 | 29.09 | 8.66 | |
| Mistral | Very Easy | 5.00 | 82.05 | 100.00 | 32.04 |
| Easy | 0.00 | 83.93 | 86.05 | 14.10 | |
| Medium | 11.11 | 40.51 | 46.01 | 29.93 | |
| Hard | 0.00 | 31.36 | 30.48 | 4.78 | |
| Human | Very Easy | 0.00 | 96.04 | 100.00 | 9.61 |
| Easy | 0.00 | 86.53 | 100.00 | 18.71 | |
| Medium | 0.00 | 81.97 | 79.59 | 11.96 | |
| Hard | 0.00 | 55.40 | 43.64 | 28.45 |
| Model | Difficulty | Execution Failure [%] | Mean | Code Coverage [%] Median | SD |
|---|---|---|---|---|---|
| Claude | Very Easy | 10.00 | 81.08 | 100.00 | 33.07 |
| Easy | 0.00 | 84.41 | 90.98 | 16.28 | |
| Medium | 11.11 | 46.04 | 52.94 | 25.36 | |
| Hard | 0.00 | 28.70 | 29.09 | 6.07 | |
| Llama | Very Easy | 5.00 | 87.14 | 100.00 | 25.53 |
| Easy | 0.00 | 80.04 | 81.25 | 15.01 | |
| Medium | 0.00 | 65.49 | 56.78 | 23.36 | |
| Hard | 0.00 | 27.56 | 25.00 | 6.19 | |
| Mistral | Very Easy | 0.00 | 84.93 | 100.00 | 23.78 |
| Easy | 0.00 | 80.04 | 81.25 | 15.01 | |
| Medium | 0.00 | 53.33 | 50.85 | 22.83 | |
| Hard | 0.00 | 27.31 | 29.09 | 7.53 | |
| Human | Very Easy | 0.00 | 96.04 | 100.00 | 9.61 |
| Easy | 0.00 | 86.53 | 100.00 | 18.71 | |
| Medium | 0.00 | 81.97 | 79.59 | 11.96 | |
| Hard | 0.00 | 55.40 | 43.64 | 28.45 |
| Model | Difficulty | Execution Failure [%] | Mean | Code Coverage [%] Median | SD |
|---|---|---|---|---|---|
| Claude | Very Easy | 0.00 | 88.16 | 100.00 | 22.78 |
| Easy | 0.00 | 80.53 | 90.44 | 21.78 | |
| Medium | 44.44 | 37.58 | 38.77 | 40.28 | |
| Hard | 0.00 | 28.79 | 30.48 | 7.41 | |
| Llama | Very Easy | 0.00 | 88.57 | 100.00 | 22.07 |
| Easy | 0.00 | 73.71 | 75.26 | 18.03 | |
| Medium | 11.11 | 45.58 | 48.85 | 20.79 | |
| Hard | 0.00 | 27.03 | 29.09 | 7.14 | |
| Mistral | Very Easy | 0.00 | 89.35 | 100.00 | 18.55 |
| Easy | 0.00 | 80.04 | 85.65 | 18.77 | |
| Medium | 22.22 | 39.22 | 38.78 | 30.17 | |
| Hard | 0.00 | 29.93 | 29.09 | 5.47 | |
| Human | Very Easy | 0.00 | 96.04 | 100.00 | 9.61 |
| Easy | 0.00 | 86.53 | 100.00 | 18.71 | |
| Medium | 0.00 | 81.97 | 79.59 | 11.96 | |
| Hard | 0.00 | 55.40 | 43.64 | 28.45 |
| Claude | Llama | Mistral | Human | |
|---|---|---|---|---|
| Claude | 8.40, 0.80, 150.80 | 8.50, 0.79, 148.64 | 8.80, 0.77, 136.41 | |
| Llama | 8.19, 0.82, 156.49 | 9.33, 0.75, 139.83 | ||
| Mistral | 9.07, 0.76, 138.47 | |||
| Human |
| Claude | Llama | Mistral | Human | |
|---|---|---|---|---|
| Claude | 8.57, 0.78, 141.60 | 8.99, 0.76, 139.15 | 9.00, 0.76, 133.62 | |
| Llama | 8.30, 0.80, 147.58 | 8.32, 0.80, 143.36 | ||
| Mistral | 8.65, 0.78, 142.82 | |||
| Human |
| Claude | Llama | Mistral | Human | |
|---|---|---|---|---|
| Claude | 9.48, 0.73, 129.88 | 10.34, 0.70, 126.98 | 9.30, 0.75, 131.36 | |
| Llama | 8.99, 0.77, 145.90 | 9.23, 0.76, 138.43 | ||
| Mistral | 9.44, 0.75, 139.26 | |||
| Human |
| Claude | Llama | Mistral | Human | |
|---|---|---|---|---|
| Claude | 7.75, 0.81, 146.01 | 8.81, 0.77, 141.69 | 9.48, 0.74, 133.83 | |
| Llama | 9.14, 0.76, 138.38 | 9.30, 0.75, 134.56 | ||
| Mistral | 9.44, 0.75, 138.40 | |||
| Human |
| Prompt | First Pair | Second Pair | p-Value |
|---|---|---|---|
| Baseline | Claude–Human | Llama–Human | 0.77 |
| Claude–Human | Mistral–Human | <0.001 | |
| Llama–Human | Mistral–Human | <0.001 | |
| Claude–Llama | Claude–Mistral | <0.001 | |
| Rich Instructions | Claude–Human | Llama–Human | < 0.001 |
| Claude–Human | Mistral–Human | 0.16 | |
| Llama–Human | Mistral–Human | <0.001 | |
| Claude–Llama | Claude–Mistral | <0.001 | |
| Example | Claude–Human | Llama–Human | 0.02 |
| Claude–Human | Mistral–Human | 0.37 | |
| Llama–Human | Mistral–Human | 0.03 | |
| Claude–Llama | Claude–Mistral | 0.01 | |
| Chain-of-Thought | Claude–Human | Llama–Human | 0.09 |
| Claude–Human | Mistral–Human | 0.01 | |
| Llama–Human | Mistral–Human | <0.001 | |
| Claude–Llama | Claude–Mistral | 0.25 |
| Prompt | Samples from | Value | p-Value |
|---|---|---|---|
| Baseline | Claude–Llama, Claude–Mistral, Llama–Mistral | 62.72 | 0.02 |
| Claude–Human, Llama–Human, Mistral–Human | 76.15 | <0.001 | |
| All of the above | 84.11 | <0.001 | |
| Rich Instructions | Claude–Llama, Claude–Mistral, Llama–Mistral | 68.86 | <0.001 |
| Claude–Human, Llama–Human, Mistral–Human | 52.04 | 0.22 | |
| All of the above | 57.95 | 0.04 | |
| Example | Claude–Llama, Claude–Mistral, Llama–Mistral | 33.34 | 0.12 |
| Claude–Human, Llama–Human, Mistral–Human | 50.87 | 0.25 | |
| All of the above | 29.50 | 0.24 | |
| Chain-of-Thought | Claude–Llama, Claude–Mistral, Llama–Mistral | 47.13 | 0.01 |
| Claude–Human, Llama–Human, Mistral–Human | 77.21 | <0.001 | |
| All of the above | 44.39 | 0.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Andrzejewski, M.; Dubicka, N.; Podolak, J.; Kowal, M.; Siłka, J. Automated Test Generation Using Large Language Models. Data 2025, 10, 156. https://doi.org/10.3390/data10100156
Andrzejewski M, Dubicka N, Podolak J, Kowal M, Siłka J. Automated Test Generation Using Large Language Models. Data. 2025; 10(10):156. https://doi.org/10.3390/data10100156
Chicago/Turabian StyleAndrzejewski, Marcin, Nina Dubicka, Jędrzej Podolak, Marek Kowal, and Jakub Siłka. 2025. "Automated Test Generation Using Large Language Models" Data 10, no. 10: 156. https://doi.org/10.3390/data10100156
APA StyleAndrzejewski, M., Dubicka, N., Podolak, J., Kowal, M., & Siłka, J. (2025). Automated Test Generation Using Large Language Models. Data, 10(10), 156. https://doi.org/10.3390/data10100156

