1. Introduction
In recent years, artificial intelligence (AI) has significantly impacted software development, particularly through advancements in machine learning and natural language processing. For example, Large Language Models (LLMs) such as Codex [
1], Qwen, Sonnet, and GPT-4 [
2] effectively support code generation, auto-completion, and error detection. Austin et al. [
3] demonstrated that fine-tuning an LLM significantly improves its program synthesis capabilities, achieving high performance in synthesizing Python programs from natural language descriptions. An important study by Chen et al. [
1] evaluated Codex on the HumanEval dataset, showing that repeated sampling from the model could effectively enhance the functional correctness of the generated Python programs, although it still struggles with more complex coding tasks. Specific benchmarks have been developed to support the evaluation of LLMs for the most relevant software development tasks: HumanEval [
1] for automatic code generation, MBPP [
3] for automatic code generation, and TestEval [
4] for test case generation.
However, the application of AI, particularly LLMs, for software verification and in particular in dynamic analysis remains underexplored. Software verification encompasses a range of activities, including testing, in which the software is executed with diverse inputs to assess its behavior and confirm its correctness. The common methodologies employed in this process include unit testing, integration testing, system testing, and formal verification techniques, particularly for critical systems. The goal is to identify and fix defects early in the development lifecycle to ensure reliability and quality.
LLMs offer functionalities that enable software verification engineers to streamline their tasks across all the stages of the verification process, including requirement analysis through natural language processing, documenting test plans, generating test scenarios, automating tests, and debugging [
5]. Although this possibility exists, few have investigated the use of LLMs for automated test case generation, one of the most time-consuming stages of software verification. In a recent study, Zilberman and Cheng [
6] illustrated that, when both programs and their corresponding test cases are generated by LLMs, the resulting test suites frequently exhibit unreliability, thereby highlighting the significant limitations in their quality. Wang et al. [
4] introduced TESTEVAL, a comprehensive benchmark containing 210 Python programs designed to assess LLMs’ capability to generate test cases, particularly measuring their effectiveness in achieving targeted line, branch, and path coverage. Importantly, all these existing studies exclusively address Python as their target language. Differentiating our approach from the previous work, we explicitly focus on evaluating the effectiveness of automatically generated tests for programs written in C, emphasizing not only the line coverage but also the the semantic correctness and practical relevance of the generated tests, thus addressing a significant gap left by the current methodologies.
In this context, this study investigates the efficacy of LLMs in generating program-level test cases, which are unit tests designed to verify the correctness of full C programs, addressing gaps in the current methodologies, enhancing software validation processes, and improving reliability in low-level programming environments. More specifically, we evaluate the top-ranked LLMs on the HumanEval benchmark to generate correct and diverse test cases that offer high line coverage for the input C programs.
Our contributions are as follows: (i) we compare the leading LLMs as of January 2025 for the task of generating program-level test cases for C programs, (ii) we evaluate these models based on test case correctness and line coverage, and (iii) we analyze the impact of contextual information, such as problem statements and sample solutions, on test generation performance.
The remainder of this paper is structured as follows:
Section 2 reviews the related work on LLM-driven software testing, highlighting the current limitations and emphasizing the research gap concerning the C programming language.
Section 3 starts with an introduction of the classical methods for test case generation and continues by detailing the methodology, tools, metrics, and datasets used in this study.
Section 4 outlines the experimental setup, discussing variations in the prompts and evaluation criteria.
Section 5 presents the results and analyses the effectiveness and coverage achieved by the LLM-generated test cases. Finally,
Section 6 discusses the implications, suggests areas for future improvement, and concludes with recommendations for advancing AI-assisted software validation.
2. Related Work
LLMs have rapidly emerged as influential tools in software testing, offering automated capabilities for test case generation, verification, and debugging. Their ability to accelerate various phases of the software testing lifecycle, from requirement analysis to post-deployment validation, has positioned them as valuable assets across a range of development environments.
Recent research has extensively examined the application of LLMs in high-level programming languages, with a particular focus on Python, Java, and JavaScript. For example, a recent study [
5] indicates that 48% of software professionals have already integrated LLMs into their testing workflows, encompassing processes from early-stage requirement analysis to bug fixing. Nevertheless, this widespread adoption is accompanied by concerns as issues such as hallucinated outputs, test smells [
7] (i.e., sub-optimal test choices, such as overly complex or unreliable), and false positives have been frequently observed. These challenges have led to calls for the development of structured methodologies and a cautious approach to integration.
Various enhancements and frameworks have been proposed to address these concerns. MuTAP [
8] represents a mutation-based test generation methodology that surpasses zero-shot and few-shot approaches by reducing syntax errors and improving fault detection in Python code. Similarly, LangSym [
9] enhances code coverage and scalability through symbolic execution guided by LLM-generated test inputs. In practice, these are full black-box test cases that include command-line invocations, input parameters, and any necessary external files, which together define executable testing scenarios. Tools such as CodaMosa [
10] integrate LLMs with search-based software testing (SBST) to optimize test coverage, while TestChain [
11] decouples input/output generation to improve accuracy. However, all of these solutions remain predominantly tailored to high-level languages.
The C programming language remains one of the most widely used languages in the world, consistently ranking in the top three according to sources such as the TIOBE Index (
https://www.tiobe.com/tiobe-index/, accessed on 21 May 2025), due to its continued relevance in system programming, embedded development, and legacy software maintenance. Its widespread adoption and persistent presence in critical software systems justify a focused evaluation of the test case generation techniques in C.
Despite recent advancements, the application of LLMs to system programming, particularly in the C programming language, remains insufficiently explored and presents significant challenges. As demonstrated in [
12], LLMs tend to produce C code of noticeably lower quality compared to other languages, characterized by reduced acceptance rates—the percentages of generated code submissions that compile successfully and pass all the predefined test cases—along with issues in functional correctness, and increased complexity and security concerns. These findings indicate that LLMs encounter difficulties with the distinctive requirements of C, which include explicit memory management, pointer arithmetic, and a more stringent type system, meaning that C enforces strict rules about variable types and operations, requiring precise type declarations and conversions. Potential contributing factors may encompass biases in the training data, the complexities of C semantics, and LLM architectures that are more aligned with higher-level language patterns.
Moreover, [
6] provided a critical evaluation of LLM-generated test suites for Python programs, identifying frequent test smells and a high incidence of false-positive rates, which further questions their reliability and maintainability. Although recent advancements, such as the web-based tool discussed in [
13], demonstrate potential in integrating LLM-driven automation into testing pipelines, issues such as hallucinations and logic errors continue to affect their practical utility, particularly in safety- and performance-critical domains such as embedded C development.
Consequently, a notable gap persists in the existing literature: although LLMs have shown promise in automating test generation for high-level programming languages, their efficacy within the context of C programming remains limited and insufficiently explored. One contributing factor to this gap is the fundamental difference in data type systems between C and high-level languages such as Python. While Python employs dynamic typing and automatic memory management, C requires strict type declarations and manual control over memory, making the generation of valid and semantically meaningful test cases significantly more complex. To address this gap, our study specifically targets the challenges associated with automated test generation in C. We place particular emphasis on semantic correctness, security compliance, and practical applicability, which are elements that are frequently neglected in the existing LLM-driven approaches. Through this focus, we aim to reconcile the disparity between theoretical potential and practical utility in the application of LLMs to low-level, high-stakes software systems.
3. Tools and Methods
3.1. Approaches to Test Case Generation
Manual test case generation relies heavily on human expertise to design tests based on software specifications, expected behavior, and critical edge conditions. This approach ensures alignment with user priorities and functional requirements. One of the key black-box testing techniques used in this context is
Equivalence Class Partitioning(ECP), which aims to reduce testing redundancy by dividing the input and output domains into equivalence classes. Each class groups values that are expected to evoke similar behavior from the system under test, thereby allowing a representative value to validate the entire class sufficiently [
14].
Complementary to ECP is
Boundary Value Analysis (BVA), a technique that focuses on the values at the boundaries of these equivalence partitions. Given the increased likelihood of faults at input extremes, BVA plays a crucial role in detecting errors caused by off-by-one mistakes or improper range handling. It can be applied through several strategies, including robust testing, worst-case testing, or advanced methods such as the divide-and-rule approach, which isolates dependencies among variables [
15].
For systems with complex input conditions and corresponding actions,
Decision Table Testing offers a structured tabular approach that maps combinations of input conditions to actions. This technique is particularly effective for identifying inconsistencies, missing rules, and redundant logic in systems with high logical complexity. It is particularly useful when testing business rules (formal logic embedded in software that reflects real-world policies, constraints, or decision-making criteria) and rule-based systems, providing a complete set of test cases derived from the exhaustive enumeration of all the meaningful condition combinations [
16].
Furthermore, in systems characterized by distinct operational states and transitions,
State Transition Testing proved to be effective. This method evaluates how a system changes state in response to inputs or events, making it particularly suitable for reactive and event-driven systems. Tools such as the
STATETest framework facilitate automated test case generation from state machine models, ensuring comprehensive coverage of valid and invalid transitions while also offering measurable coverage metrics [
17].
Together, these techniques form a robust foundation for functional test design by maximizing test coverage, improving defect detection, and reducing redundant effort through systematic abstraction and modeling.
Automated methods leverage tools and algorithms to efficiently generate test cases, covering a wide range of inputs with minimal human intervention. These techniques improve scalability, detect edge cases, and enhance reliability. The common approaches include
Symbolic Execution (e.g., KLEE in [
18]), which analyzes possible execution paths by treating program variables as symbolic values and generating input values to explore diverse scenarios.
Bounded Model Checking (e.g., CBMC in [
19]) systematically explores program states up to a certain depth, identifying logical inconsistencies, assertion violations, and potential errors.
Fuzz Testing (e.g., AFL, libFuzzer in [
20]) randomly mutates inputs and injects unexpected or malformed data into programs to uncover security vulnerabilities, crashes, and undefined behaviors.
While traditional approaches such as symbolic execution, model checking, and fuzzing are effective at achieving deep code coverage and uncovering low-level vulnerabilities, they often require access to the full source code, complex instrumentation, and significant computational resources. In contrast, LLM-based test case generation operates at a higher level of abstraction: it can generate both inputs and expected outputs directly from problem descriptions without relying on execution traces or instrumentation.
This makes LLMs particularly valuable in early-stage development, educational settings, or scenarios where the source code is incomplete or unavailable. However, unlike symbolic execution tools such as KLEE, LLMs do not guarantee path coverage, and their correctness depends heavily on the training data and prompt design. Although fuzzers such as AFL can discover edge-case crashes through randomization, LLMs rely on learned patterns and may overlook rare or adversarial inputs. As such, LLM-generated tests are complementary to the traditional methods, offering flexibility and semantic reasoning where instrumentation-heavy tools may fall short, but they lack the exhaustive precision that formal techniques can provide. These distinctions highlight the complementary nature of LLM-based and traditional test generation approaches. To provide a clearer overview,
Table 1 summarizes the main differences between these techniques across the key dimensions.
Mutation testing evaluates the effectiveness of existing test cases by introducing small controlled modifications (mutants) to the program code. If the test cases fail to detect these modifications, they indicate gaps in the test coverage. The key aspects include generating mutants by altering the operators, statements, or expressions in the code, executing test cases to check whether they distinguish the original code from its mutants, and identifying weaknesses in the test suite, improving the test case robustness.
Moreover, the use of AI techniques in other high-risk domains further reinforces the feasibility of applying similar approaches to test generation for C. For instance, AI-driven frameworks have been successfully deployed to control complex industrial systems such as coal and gas outburst accidents [
21] using text mining and knowledge graphs to identify causal factors and formulate control strategies. Similar to embedded C applications, these systems require high levels of safety, reliability, and explainability.
In a broader safety engineering context, AI has also been studied in collaboration with human operators to enhance situational awareness and support decision-making in aviation logistics systems [
22]. These findings support the perspective that AI-based tools for C test case generation can meaningfully contribute to software reliability, particularly when integrated with human-in-the-loop validation strategies.
Test Case Generation in C
The C programming language plays a pivotal role in embedded systems, operating systems, and safety-critical applications, where even minor software faults can result in catastrophic failures. Unlike high-level environments such as web applications developed in Python 3 or JavaScript ES2022, where a runtime error may only impact user experience, bugs in C can lead to memory corruption, system crashes, or even hardware malfunction. This distinction is critical in domains such as automotive software, avionics, medical devices, and industrial control systems, where correctness and reliability are not negotiable.
However, C also presents unique challenges for automated test generation due to its low-level features, such as direct memory access, pointer arithmetic, lack of runtime safety, and undefined behaviors. These intricacies make C significantly more difficult to test in an automated fashion than in managed languages.
To address these challenges, specialized tools have been developed. Csmith [
23] generates random semantically valid C programs to test compilers and reveal hidden miscompilation bugs, while Ocelot [
24] uses search-based algorithms tailored for C to produce meaningful test inputs. These tools not only validate compiler behavior but also help developers to uncover memory safety violations and logic faults early in the development lifecycle.
The recent literature further emphasizes the trend toward combining symbolic analysis, evolutionary strategies, and AI to generate tests that target realistic failure modes in C programs [
25,
26,
27]. As such, advancing automated test generation techniques specifically for C constitutes a substantial contribution to both research and industry, offering a practical pathway toward safer and more reliable low-level software systems.
Generating effective test cases for C programs is essential to validate that an application performs as intended across various scenarios. By utilizing both manual and automated approaches for test case generation, developers can enhance program reliability and ensure it meets user expectations. C programs require various types of test cases to ensure reliability and correctness. Unit test cases focus on testing individual functions or modules in isolation, using frameworks such as Google Test, CUnit, and Check. For example, a factorial function should be tested with inputs such as 0, 1, and a large number to validate correctness. Boundary test cases check input limits by testing extreme values. For a function accepting integers between 1 and 100, inputs such as 0, 1, 100, and 101 should be tested to verify the proper rejection of out-of-range values. Positive and negative test cases differentiate between valid inputs, which confirm expected functionality, and invalid inputs, which ensure proper error handling—such as testing a file-opening function with both a valid and nonexistent file path. Edge case testing examines unusual conditions, such as testing a sorting function with an empty array, a single-element array, and an already sorted array. Integration test cases ensure multiple modules work together correctly, such as testing a database connectivity module along with a data-processing function. Performance test cases evaluate efficiency under high loads, such as analyzing the time complexity of a sorting algorithm with increasing array sizes. Security test cases check for vulnerabilities such as buffer overflows or command injection using tools such as Valgrind or AddressSanitizer. Finally, regression testing helps to detect new bugs introduced by code modifications by running automated test suites after updates to maintain stability.
Nonetheless, the existing tools present notable limitations as they typically require access to the program’s source code and are often constrained by language-specific dependencies. They are generally incapable of generating test cases solely from natural language problem statements. Therefore, advancing software verification would benefit from approaches that enable the automatic generation of test cases for C programs directly from textual specifications. Such capabilities would facilitate early-stage validation, particularly in educational and assessment contexts, while also reducing reliance on language-specific infrastructure.
3.2. Selected LLMs for Comparison
LLMs, such as GPT, LLaMA, Sonnet, Nova, and Qwen, have the potential to substantially enhance test case generation for C programs by automating and optimizing various aspects of software testing. These models leverage their extensive training in code, software testing principles, and program analysis techniques to assist in the generation, optimization, and analysis of test cases. Their roles can be categorized into several key areas, such as automated generation of unit test cases [
28], enhancing fuzz [
29] testing, symbolic execution assistance [
30], mutation testing automation [
31], natural-language-based test case generation [
32], and code coverage analysis and optimization [
33].
For the selection of models used in our comparison, we adopted an approach grounded in both the specialized literature and the up-to-date rankings (as of late December 2024) provided by the
EvalPlus leaderboard (
https://evalplus.github.io/, accessed on 18 October 2024), which evaluates models on an enhanced version of the HumanEval benchmark. EvalPlus extends the original HumanEval dataset by incorporating additional test cases to improve the reliability of the
Pass@K metrics. In this context, we identified and selected the top five models ranked on the EvalPlus leaderboard, taking into account both the diversity of providers and architectural differences among the models. This selection enables us to assess the performance of code generation algorithms in an objective and balanced manner without favoring a particular developer or architectural family. The selected models represent a cross-section of the most advanced LLMs available as of late 2024. A brief overview of each model is presented below: GPT-4 Turbo (OpenAI) is a proprietary model from the GPT family, known for high-quality code generation, broad instruction tuning, and extensive exposure to programming languages during pretraining. It uses a decoder-only transformer architecture and supports long context windows, making it particularly effective for understanding and reasoning complex prompts.
Claude 2.1 (Anthropic), evaluated via the Sonnet interface, is another proprietary model focused on safe and interpretable responses. It is trained with constitutional AI principles and performs competitively in code generation, particularly in producing logically structured and well-annotated outputs. Claude models also exhibit strong semantic alignment, which is beneficial for understanding specification-like inputs.
LLaMA 3 (Meta) is an open-weight decoder-only transformer model, trained on a multilingual and code-rich corpus. Although smaller in scale than GPT-4, it shows remarkable reasoning ability and competitive performance in code-related benchmarks. Its openness also allows community-driven fine-tuning and reproducibility.
Nova (Amazon) is a proprietary model accessible via the Amazon Bedrock platform. Although detailed architectural documentation is not publicly available, Nova is optimized for general-purpose use cases, including code generation, summarization, and question answering. Nova provides insights into the performance of LLMs from major cloud providers, which are not typically benchmarked in academic datasets.
Qwen-2 (Alibaba) is an open-weight multilingual model trained extensively on programming and natural language corpora. It performs well in reasoning-heavy tasks and code synthesis, particularly in structured domains such as data manipulation and problem solving.
The purpose of this analysis is to compare the accuracy of the models on two relevant benchmarks for code generation:
HumanEval (
https://github.com/openai/human-eval, accessed on 5 October 2024), which tests the models’ ability to solve programming problems in a way similar to human evaluation, and
MBPP (Mostly Basic Python Programming) (
https://github.com/google-research/google-research/tree/master/mbpp, accessed on 5 October 2024), a benchmark focused on tasks of varying difficulty in Python programming. This enables us to determine the extent to which the selected models can generalize across different problem types and whether their performance remains consistent across multiple datasets and programming languages.
Table 2 includes the five selected models along with their reported accuracy on the HumanEval and MBPP benchmarks. This table provides a clear comparative perspective on the models’ performance and serves as a reference point for further analysis of their capabilities. In the following sections, we test these models on our dataset with the goal of generating test cases, allowing us to assess their ability to produce relevant and diverse test scenarios for evaluating code correctness and robustness.
3.3. Comparative Analysis Framework
3.3.1. Proposed Datasets
The dataset used for evaluation in this study comprises 40 introductory-level programming problems, the majority of which have been specifically designed and developed by us to ensure originality and minimize overlap with the existing datasets used for training LLMs. This approach ensures an objective evaluation of the model’s ability to solve problems without external influence from pre-existing training datasets. Each problem is presented with a clearly formulated statement, a correct solution implemented in both C and Python, and a set of 8 to 12 relevant test cases designed to verify the correctness of the proposed solutions.
The problems are designed for an introductory C programming course and vary in difficulty within the scope of beginner-level concepts, covering a broad spectrum of fundamental programming topics. These include working with numerical data types, using arithmetic and bitwise operators, formulating and evaluating conditional expressions, and implementing repetitive structures. Additionally, the dataset includes exercises that involve handling one-dimensional and multi-dimensional arrays, managing character strings, and applying specific functions for their manipulation. Furthermore, it contains problems requiring the use of functions, including recursive ones, and the utilization of complex data collections such as structs, unions, and enumerations. The topic of pointers is also covered as it is an essential concept in low-level programming languages.
For each problem, the task involves reading input data and displaying results on the standard output stream, reinforcing knowledge of fundamental input and output mechanisms in programming. The programs from the dataset also include the procedures for reading and writing files, ensuring the practical relevance of the generated tests. As a result, this dataset serves not only as a suitable environment for testing and training artificial intelligence models but also as a valuable tool for learning and deepening the understanding of essential programming principles.
Table 3 provides a breakdown of the problems by topic, along with the average difficulty level assigned to each category. This illustrates not only the diversity regarding the concepts and techniques covered by the dataset but also ensures that model performance is evaluated across varying levels of complexity within the C programming language.
3.3.2. Evaluation Metrics
We selected Pass@1 and line coverage (via LCOV) as the primary evaluation metrics due to their interpretability and widespread use in code generation benchmarks. Pass@1 directly reflects the correctness of generated test cases against a reference implementation, while line coverage quantifies the breadth of code exercised during execution.
Although these metrics capture essential aspects of test case effectiveness, we acknowledge that additional metrics, such as branch coverage, mutation score, or path coverage, could offer deeper insights into structural completeness and fault detection capabilities. Incorporating such metrics in future work could provide a more comprehensive evaluation of test quality, particularly in safety-critical or security-sensitive C programs.
The
Pass@K metric is widely used to evaluate the accuracy of code generation models by measuring the percentage of correct solutions within a set of K-generated samples. It aims to analyze how many of the model’s proposed solutions are valid after multiple attempts. The formula for calculating
Pass@K is
Pass@1 indicates whether the first generated solution is correct. Pass@10 and Pass@100 are extended versions, useful in large-scale evaluations where models generate multiple code samples to increase the chances of finding a valid solution. This is essential in scenarios requiring diverse solutions [
34,
35]. In this article, the
Pass@K metric was used to evaluate the number of functionally correct test cases proposed by the LLM models. Functionality is defined by executing the code in C, which is considered the correct solution to the problem, and comparing the output generated by the LLM model with the output of the code for the same input proposed by the model.
Line coverage is a code coverage metric that measures the percentage of lines executed in a program during a test run. This helps to assess the extent to which the test cases exercise the code. The formula for calculating line coverage is
A high line coverage percentage indicates that most of the code is executed by the test suite, reducing the chances of untested and potentially buggy code. LCOV is a graphical front-end for gcov, which is a code coverage tool used with GNU Compiler Collection (GCC). LCOV generates easy-to-read HTML reports for visualizing code coverage. We extracted the line coverage (%) from the report generated by LCOV.
4. Experimental Setup
The experimental setup follows the structured workflow illustrated in
Figure 1, where an LLM is used to generate unit tests for programming problems. The process begins with a system prompt that establishes the task and its context. A corresponding user prompt is then constructed using information from the Programming Problems Database, which includes problem statements and verified ground truth solutions.
The LLM generates candidate test cases, which are subsequently validated by executing the reference solution for each problem on the generated inputs. If the output matches the expected behavior defined by the problem specification, the test case is considered valid. Otherwise, the test case is discarded. Finally, we compute the evaluation metrics as defined in the previous section and generate summary reports.
System Prompt
To maintain consistency and enforce a structured output format for easy automation, we designed the following system Listing 1:
Listing 1. System prompt example |
![Electronics 14 02284 i001]() |
The system prompt was carefully designed to enforce a standardized format for input and output files, enabling automated validation of generated test cases. Specifically, it required the generation of exactly 10 test cases per problem, systematically covering typical scenarios, edge cases, and boundary conditions. Each test case was written to separate files—inputX.txt and outputX.txt—with a consistent structure: each file begins with a labeled line (Input X: …/Output X: …) and ends with a marker line (Final_Input X/Final_Output X). This formatting ensures compatibility with automated scripts and facilitates framework scalability while maintaining consistency and reproducibility during testing.
To ensure that the prompt design effectively guided the models toward generating valid and comprehensive test cases, we conducted a two-step verification process. First, we manually analyzed the outputs from a subset of problems for each prompt variant, checking whether the format was respected and whether the test cases exhibited coverage of normal, edge, and boundary conditions. Second, we compared prompt variants quantitatively across all 40 problems using the Pass@1 and line coverage metrics. This allowed us to assess whether additional context (e.g., solution code and examples) consistently improved test quality.
The observed improvements, particularly with the
stmt+C_sol+3tests configuration, suggest that well-structured and context-rich prompts significantly enhance the ability of LLMs to reason about expected behavior.
Appendix A provides a complete prompt example that can serve as a guideline for practitioners aiming to replicate or extend this method.
User Prompt Variations
Throughout the experiments, the user prompt varied to analyze how different types of input affected the LLM’s ability to generate meaningful test cases. These variations are listed in
Table 4. The test cases in
3tests contained real values for a specific problem statement, including
stmt+py_sol to compare C and Python performance, with evaluation using Python.
By systematically varying the user prompt, we assessed the impact of different levels of information on the quality and relevance of generated test cases. A full example of a system and user prompt used in the
stmt+C_sol+3tests experiment is provided in
Appendix A. The five best-performing models, as listed in
Table 2, were used in these experiments. To assess the effectiveness of the generated test cases, we employed two key evaluation metrics:
Pass@1—Measures the percentage of correct solutions that pass all generated test cases on the first attempt.
Line Coverage—Evaluates how well the generated test cases cover the solution’s lines of code.
5. Results and Discussion
Each experiment ID corresponds to a distinct series of tests in which the input provided to the LLM was systematically varied. The specific configurations that were tested are listed in
Table 4.
For each configuration, the generated test cases were executed to determine whether the program output exactly matched the expected output produced by the LLM for the given input. The performance was evaluated using the Pass@K metric, where k = 1, as only a single attempt per test case was permitted.
5.1. Experimental Results
When comparing the different experiment IDs, several key observations emerged (
Table 5). As anticipated, models such as
Llama 3.3 and
Claude-3.5 Sonnet demonstrated significantly improved performance when the solution was included (exp
stmt+C_sol) rather than when only the problem statement was provided (exp
stmt). This suggests that these models benefited from direct exposure to the C code solution rather than just the problem statement. Furthermore, when comparing the inclusion of solutions in Python versus C,
GPT-4o Preview showed noticeable improvement with the Python solutions, indicating that it may be better aligned with Python-based reasoning and code generation.
A more detailed comparison between the experiments where only the problem statement was given, versus where both the problem statement and solution in C were provided, highlights a strong advantage in providing the solution as well. GPT-4o Preview demonstrated a substantial improvement in accuracy when both elements were included. Similarly, when evaluating whether adding three example test cases alongside the solution in C made a difference, it was evident that the test case examples played a crucial role. The inclusion of three test cases led to significantly improved results for GPT-4o, Llama 3.3, and Claude-3.5 Sonnet. However, this trend was not observed for Amazon Nova and Qwen2.5, indicating that these models did not leverage example test cases as effectively.
For the three best-performing experiments (
stmt+C_sol,
C_sol+3tests, and
stmt+ C_sol+3tests),
Table 6 presents the line coverage (%) results, which measure the percentage of code covered by the generated test cases.
Claude-3.5 Sonnet achieved the highest code coverage overall, obtaining a line coverage score of 99.2% for exp
stmt+C_sol. It maintained strong results, with 98.1% coverage for both exp
C_sol+3tests and
stmt+C_sol+3tests, highlighting its robustness and versatility in generating diverse and comprehensive test cases.
GPT-4o Preview also demonstrated notable performance, achieving 98.7% code coverage in the
stmt+C_sol experiment and maintaining similarly high line coverage values across exp
C_sol+3tests and
stmt+C_sol+3tests, indicating consistent and reliable test generation capabilities.
Llama 3.3 exhibited stable and consistent coverage across the experiments, with line coverage values ranging from 96.1% to 98.4%, reflecting dependable performance. Similarly,
Qwen2.5 showed a stable performance, achieving coverage scores consistently between 97% and 98%. Conversely,
Amazon Nova Pro had the lowest recorded line coverage for exp
stmt+C_sol at 95%, but it improved significantly to match the other top-performing models in exp
C_sol+3tests and
stmt+C_sol+3tests, with the coverage increasing to 98.1%. These detailed findings underscore
Claude-3.5 Sonnet’s superior capacity for generating thorough and diverse test cases while also highlighting the competitive strengths and areas for improvement among the other evaluated LLMs.
5.2. Strengths and Weaknesses of LLMs in C Test Case Generation
While LLMs such as GPT-4o Preview have demonstrated impressive capabilities in generating test cases for C programs—with a line coverage of 98.7% when provided with both the problem statement and its corresponding C implementation—certain inherent limitations persist. These models generally excel in problems involving simple control flows, standard data manipulation, or repetitive logic patterns. However, test case generation becomes significantly more error-prone in scenarios involving floating-point precision, edge-case mathematical computations, or structurally complex input constraints.
To better understand the specific areas where LLMs struggle, we manually analyzed the cases in which the generated test cases failed to match the expected behavior despite being syntactically valid.
Table 7 summarizes the most common problem types that lead to such failures, along with brief descriptions of the difficulties encountered.
These high-level categories are further illustrated in the following two case studies, which expose recurring failure patterns in the test case generation for numerical and logical problems:
Problem 5 requires computing the following expression:
and displaying the result in scientific notation using double-precision floating-point arithmetic.
Appendix A provides a full description of Problem 5.
Even top-tier models such as GPT-4o Preview and Claude-3.5 Sonnet failed to consistently produce all ten correct test cases. The common issues included small numerical discrepancies, such as predicting 6.347340e+00 instead of the expected 6.347558e+00. These errors stem from the fact that LLMs rely on learned approximations of floating-point behavior rather than executing actual mathematical functions such as exp() and cos() from math.h.
An additional source of failure was the misunderstanding of angle units: some models assumed degrees instead of radians. Including clarifying comments in the prompt (e.g., “// x is in radians”) improved the results, increasing the test pass rates from 0 to 6 out of 10 for LLaMA 3.
Problem 9 involves reading four integers,
nRows,
nCols,
row, and
col, and computing
Models frequently fail to correctly interpret the +1 offset, sometimes simplifying it or misapplying the logic. Even with the full problem description and correct solution, the absence of example test cases resulted in consistent underperformance across most of the models, suggesting difficulty in indexing logic and expression order in nested arithmetic.
These two case studies illustrate specific domains where LLMs are less effective:
Numeric computations involving floating-point behavior and math library functions.
Logical reasoning involving offsets, indexing, or nested arithmetic operations.
Although advanced models still offer strong baseline performance in terms of coverage and structure, these examples expose their brittleness in scenarios that require more than learned statistical patterns. In this context, LLMs can benefit from hybridization with symbolic or analytical tools.
6. Conclusions
This study investigated the effectiveness of LLMs in generating program-level test cases for C programs, specifically focusing on unit tests. Unlike the previous research, which has primarily concentrated on Python, our approach focused on C. We highlighted not only the line coverage but also the semantic accuracy and practical applicability of the generated tests. By evaluating the top-ranked LLMs from the HumanEval benchmark, we addressed significant gaps in the existing methodologies and contributed to the advancement of software validation in low-level programming environments.
As expected, the results indicate that LLMs perform significantly better when provided with both the problem statement and its solution in C rather than the problem statement alone. Including example test cases in the input further enhances the performance, although the degree of improvement varies across different models. While models such as GPT-4o Preview, Llama 3.3, and Claude-3.5 Sonnet demonstrated substantial gains when given additional test cases, Amazon Nova and Qwen2.5 did not show notable benefits from this added context.
Claude-3.5 Sonnet and GPT-4o Preview emerged as the strongest performers, achieving both high Pass@1 scores and line coverage, demonstrating their ability to generate precise and effective test cases. On the other hand, Qwen2.5 and Amazon Nova Pro consistently underperformed, suggesting that they are less sensitive to additional contexts in the form of solutions or example test cases.
Overall, the findings highlight the importance of providing structured input to LLMs for generating effective test cases. To avoid potential data contamination and ensure a fair evaluation, we used original problems and C code that was not publicly available, thereby reducing the likelihood that these examples were part of the models’ training data. Under these controlled conditions, we specifically analyzed whether starting from a correct solution enables an LLM to generate meaningful test cases that focus solely on verification. By comparing outputs under different input configurations (problem statement alone vs. statement plus code), our study contributes to a clearer understanding of how contextual information influences test generation quality. This approach offers practical insights into leveraging LLMs more effectively in real-world software validation workflows, particularly in low-level programming environments where correctness is critical.
In addition, we analyzed two problem types in depth that expose the specific limitations of LLMs in test case generation for C programs. The first involves floating-point computations using math library functions, where models often fail to produce precise outputs due to their reliance on approximations rather than actual numerical evaluation. The second concerns logical reasoning tasks involving offsets, indexing, or nested arithmetic expressions, where misinterpretation of subtle operations as off-by-one logic was commonly observed. The findings highlight that, although advanced models generally achieve high coverage, they exhibit high sensitivity to contexts that demand precise numerical reasoning or structural logic. The inclusion of additional contextual information in the prompt enhanced the performance, suggesting that future improvements may be realized by integrating LLMs with symbolic or analytical tools.
Given these findings, future research directions are proposed to further enhance the effectiveness of LLMs in test case generation. Future research could explore a broader set of C programs with varying complexities, investigate fine-tuning LLMs on C-specific or testing datasets, and integrate static or dynamic analysis tools to better assess the test quality. Additionally, combining LLM-generated tests with traditional methods may help to address the current limitations in reliability and coverage.
Future research could explore a broader and more diverse set of C programs, including those involving pointer arithmetic, file I/O, and memory management, which are particularly challenging for LLMs. A promising direction is the creation of domain-specific datasets and benchmarks tailored for test case generation in C. These should include not only the program code but also structured prompts, expected test behaviors, and evaluation criteria, which would help to standardize and guide LLM-based evaluation.
Furthermore, the systematic design of prompts, including the inclusion of program context, function specifications, and I/O formats, should be studied in greater depth as it significantly affects model performance. Fine-tuning open-weight models (e.g., LLaMA and Qwen) on curated C-specific corpora can improve their adaptability and output correctness.
To enhance reliability, future work may also integrate feedback from static or dynamic analysis tools (e.g., CBMC and Valgrind) for the automatic validation of generated tests. Finally, hybrid approaches combining LLM-generated inputs with traditional techniques such as symbolic execution (KLEE) or fuzzing (AFL) could offer complementary benefits, resulting in more robust and semantically rich test suites.