We evaluate the effectiveness and usefulness of our approach by answering the following three research questions:
4.2. Baseline
We select both traditional representative approaches and recent popular LLM-based methods for comparison in DL library testing. The selected works are as follows.
FreeFuzz extracts parameter information from open-source code to construct a comprehensive parameter value space, then applies mutation strategies to enable fully automated testing of DL libraries. They conducted experiments on PyTorch and TensorFlow.
DeepREL automatically infers APIs within a DL library that share similar input–output relationships and leverages known APIs’ test inputs to validate and fuzz-test these inferred functions. They conducted experiments on PyTorch and TensorFlow.
Muffin generates a diverse set of DL models and performs cross-library differential testing during the training phase, thereby exhaustively exploring library code and uncovering additional defects. They conducted experiments on Tensorflow, CNTK, and Theano.
TitanFuzz is the first approach to employ LLMs to produce valid, complex DL program inputs, achieving zero-shot automated fuzz testing of DL libraries. They conducted experiments on PyTorch and TensorFlow.
FuzzGPT combines historical bug-triggering code with LLM fine-tuning and in-context learning to automatically generate edge-case samples. They also conducted experiments on PyTorch and TensorFlow.
Our approach differs from previous methods by incorporating historical vulnerability knowledge as well as constraint knowledge from API documentation to guide LLMs’ seed generation. Additionally, we design multiple prompts with different types of knowledge and utilize an MAB model for dynamic prompt selection. This strategy enhances the comprehensiveness of our testing.
4.4. RQ1: Comparison of Seed Generation Quality with Existing Methods
We first evaluate the quality of the seeds generated by our approach to verify its effectiveness. Seed quality is a broad concept, and following prior work, we assess it from three perspectives: overall code coverage, the rate of valid seeds, and API coverage. Experiments are conducted on PyTorch (v1.12) and TensorFlow (v2.10), following the same setup as used in FuzzGPT and other related studies [
9].
Firstly, we compare our proposed approach with several traditional methods, including FreeFuzz, DeepREL, and Muffin. The experimental results are summarized in
Table 5. For the PyTorch library, our method achieves 33.59% line coverage and 94.91% API coverage. This line coverage significantly surpasses the 13% achieved by baseline methods such as FreeFuzz and DeepREL. Similarly, on TensorFlow library, our approach reaches 46.29% line coverage and 94.99% API coverage. This performance starkly contrasts with the 30% line coverage of existing tools, while our API coverage remains exceptionally high. These findings suggest that the semantic understanding and generation capabilities inherent to LLMs facilitate a more effective exploration of API call paths within DL libraries, leading to broader and more thorough test coverage.
We further evaluated our approach with the state-of-the-art LLM-based methods, TitanFuzz and FuzzGPT, on the PyTorch and TensorFlow libraries. The comparison was adjudicated by three core metrics: line coverage, valid rate, and API coverage. The results are summarized in
Table 6 and
Table 7.
Table 6 reports the API coverage achieved by each method on two libraries. Our method consistently outperforms all baselines, achieving the highest API coverage on both PyTorch and TensorFlow libraries. Specifically, our method surpasses the best-performing baseline by approximately 8% on PyTorch and by 25% on TensorFlow. This demonstrates our method’s superior capability in exercising diverse and deep API call paths, which is critical for exposing hidden vulnerabilities that depend on intricate API interactions.
Table 7 presents a detailed comparison of line coverage and valid seed rate achieved by our method against LLM-based methods. The results unequivocally demonstrate the superiority of our approach across both the PyTorch and TensorFlow libraries. On PyTorch, our method achieves 33.59% line coverage and a 30.82% valid seed rate, significantly outperforming the strongest baseline, FuzzGPT. The performance gap widens notably on TensorFlow, where our approach attains a striking 46.29% line coverage—marking a substantial improvement of 1.99 percentage points over FuzzGPT’s 44.30%.
This demonstrates the advantage of our approach, which effectively balances seed validity and vulnerability discovery capability. We attribute this advantage to two key factors: the incorporation of rich domain knowledge to guide LLM seed generation and the use of the MAB model to dynamically tailor the seed generation strategy.
4.5. RQ2: Contribution of Key Components to the Effectiveness of Our Approach
In our approach, we utilize an MAB model, where the arms and the reward function are critical components. To assess the individual contribution of each component to the overall performance, we conducted a series of ablation studies. We randomly sampled 100 APIs from three DL libraries, conducted the ablation study experiments five times, and reported the average results.
Arms: In our method, we treat prompts constructed from different types of knowledge as individual arms. Since each arm has the opportunity to be selected, their design is of critical importance. We first investigate the effectiveness of the arms themselves. We evaluated the performance of the four arms described in
Section 3.3.1 across three DL libraries. As a baseline, we included an unguided strategy that operates without prior knowledge. The results are summarized in
Table 8.
Compared to the without_any_strategy baseline, both the rag_historical_vuln and mixed_strategy approaches incorporate external vulnerability knowledge. As demonstrated in tests across three DL libraries, seed validity slightly decreases, while code coverage improves. For example, in the PaddlePaddl’s experiments, seed validity under the without_any_strategy baseline is 14.38%, whereas it drops to 12.16% and 12.00% with the rag_historical_vuln and mixed_strategy methods, respectively. Meanwhile, code coverage rises from 21.07% to 21.54% and 21.18%. Our analysis of the generated seed code reveals that the external vulnerability knowledge is related to PyTorch and TensorFlow, causing the LLM to include PyTorch or TensorFlow APIs in parts of the code it generates for the target DL library, which in turn reduces seed validity. Nevertheless, the incorporation of external vulnerability knowledge also enhances code coverage, and in some seeds, we can observe the LLM’s reasoning process regarding this external knowledge, thereby validating the effectiveness of our strategies.
Our experiments show that, compared with the without_any_strategy baseline, leveraging constraint knowledge significantly boosts both code coverage and seed validity. By guiding the LLM’s generation process and preventing errors such as invalid parameter usage, constraint knowledge ultimately enhances the quality of the generated seeds. Moreover, we observed an interesting phenomenon: under the violate_constraints strategy, both seed validity and code coverage still improved over the without_any_strategy baseline but remained lower than with the leveraging constraint strategy. This is because DL libraries often incorporate implicit fault-tolerance mechanisms, such as automatic type coercion, tensor broadcasting, and default parameter filling. These mechanisms enable certain constraint-violating inputs to still execute successfully and even reach deeper logic branches. As a result, seed validity is counter-intuitively improved.
In summary, each of our designed arms targets a distinct aspect. The integration of external knowledge enhances the quality of seeds generated by the LLM, and when coordinated by the MAB model, their performance improves even further.
Reward Function: To evaluate the contribution of each reward component in guiding effective seed generation, we conduct a series of ablation experiments by selectively removing individual components from the overall reward function. Specifically, we evaluate four partial reward:
,
,
, and
, each corresponding to the exclusion of crash detection, coverage guidance, AST structural reward, and output behavior feedback, respectively. We evaluate the performance across three DL libraries, and the results of this analysis are presented in
Table 9.
The ablation results in
Table 9 clearly demonstrate the importance of each component in the reward function. The full reward setting (
R) consistently achieves the highest coverage and validity across all three DL libraries, confirming the effectiveness of our composite reward design in guiding high-quality seed generation. Notably, removing the coverage reward (
) leads to the most significant drop in validity, especially on MindSpore (from 5.28% to 4.20%), highlighting its critical role in steering the generation process toward functionally meaningful inputs. Additionally, removing the structure reward (
) causes a sharp decline in validity on OneFlow (from 16.96% to 10.93%), indicating that syntactic and semantic guidance is essential for maintaining the correctness of generated seeds. In contrast, excluding the crash reward (
) and the output reward (
) results in relatively moderate performance degradation. For instance, the removal of
has minimal effect on validity in OneFlow but a more noticeable impact on MindSpore, suggesting platform-specific sensitivity to output behavior.
Overall, no single reward component can independently achieve optimal performance. Each part contributes uniquely to different aspects of seed effectiveness—such as behavior triggering, structural validity, and execution semantics—underscoring the necessity of a holistic and multi-dimensional reward strategy.
4.6. RQ3: Detection of Real-World Bugs Using Our Approach
We evaluated three DL libraries using our approach, and the summary of the detected bugs is presented in
Table 10. In total, 51 bugs were identified, of which 17 were confirmed to be previously unknown. These bugs have been reported to the developers for further verification and confirmation. And these bugs may cause the program to crash during execution, thereby potentially leading to denial-of-service attacks. A detailed list of the bugs can be found in our GitHub repository (
https://github.com/deepliao/Bug_list, accesssed on 24 September 2025). Below, we present examples of the bugs detected by our approach, which originate from different libraries.
Figure 5a illustrates a representative example of discovering a bug through the Arm1 strategy, driven by historical vulnerability knowledge transfer. The upper part of the
Figure 5a presents a known historical bug: in this case,
torch.CharStorage, a low-level memory structure in PyTorch, was incorrectly used as the output target for
torch.save, despite not implementing a file-like interface. Due to the absence of essential methods, this misuse ultimately triggered an
abort exception. We collected this issue as an entry in our vulnerability knowledge base.
When testing
torch.BoolStorage, our method employed a vector-based semantic retrieval mechanism to match this historical issue and automatically injected its structured summary into the seed-generation prompt for the LLM. During the generation process, the model, guided by the injected vulnerability knowledge, demonstrated an initial understanding of the underlying failure mechanism, as shown in the green comment section, and attempted to transfer the same logic to
BoolStorage. The lower part of
Figure 5a shows the resulting test case, exemplifying how historical bugs can effectively drive reasoning and code construction in language models. This case clearly demonstrates that incorporating vulnerability knowledge can significantly enhance the quality and relevance of generated test cases.
Figure 5b illustrates a bug triggered by Arm2 invoking
torch.Tensor.addcdiv in a manner consistent with its API documentation, yet resulting in an internal runtime failure. Specifically, executing the code causes an
INTERNAL ASSERT FAILED error from the underlying C++ backend. According to the official documentation,
addcdiv performs an element-wise division between two tensors, scales the result by a scalar value, and adds it to the input tensor. However, the documentation does not explicitly state that the input tensor (i.e.,
self) must be of a floating-point type. In our case, we passed an
int64 tensor as the input, along with two
float32 tensors for the division operands. While the usage appears semantically valid based on the API description, the backend fails to handle this type mismatch and crashes due to an assertion failure. This bug reveals a gap between the documented interface and the actual implementation behavior, highlighting a lack of type safety enforcement.
Figure 5c illustrates a bug triggered by Arm3 due to a violation of the API constraint. Specifically, executing the code results in a
Segmentation fault (core dumped) error. According to the documentation of
flow.IntTensor, the parameter
data is expected to be one of the following types:
list,
tuple,
NumPy ndarray,
scalar, or
tensor. However, we passed a string input
’invalid’, which clearly violates the interface specification. The framework fails to perform type checking at the Python frontend, and the invalid input is forwarded directly to the underlying C++ implementation. This leads to undefined behavior and ultimately results in a segmentation fault. This bug exposes a deficiency in the framework’s input validation mechanism, highlighting potential weaknesses in its robustness and security, especially in handling boundary inputs and exceptional types.
Our analysis indicates that the root cause of these bugs lies in the lack of standardized API design. This is reflected in ambiguous interface contracts, inconsistencies between documentation and implementation, and insufficient input constraints and error-handling mechanisms. Such design deficiencies lead to unpredictable behaviors in complex application scenarios, ultimately causing program crashes and exposing potential security risks. We therefore recommend that developers of DL libraries place greater emphasis on standardized API design in future development. In particular, interface contracts should be clearly specified, documentation and implementation must remain strictly aligned, and robustness and security considerations should be integrated into the design stage. These measures can address the fundamental sources of such bugs and enhance the reliability and safety of DL libraries.