Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing

Liao, Rongtao; Yan, Xuehu; Pang, Zeshan; Zhu, Kailong

doi:10.3390/app151910396

Open AccessArticle

Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing

¹

College of Electronic Engineering, National University of Defense Technology, Hefei 230031, China

²

Anhui Province Key Laboratory of Cyberspace Security Situation Awareness and Evaluation, Hefei 230031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10396; https://doi.org/10.3390/app151910396

Submission received: 30 August 2025 / Revised: 20 September 2025 / Accepted: 23 September 2025 / Published: 25 September 2025

Download

Browse Figures

Versions Notes

Abstract

Fuzzing deep learning (DL) libraries is essential for uncovering security vulnerabilities in AI systems. Existing approaches enhance large language models (LLMs) with external knowledge such as bug reports to improve the quality of generated seeds. However, most approaches still rely on static strategies or single knowledge sources, limiting their ability to produce syntactically valid inputs that also expose deeper bugs. To address this challenge, we propose an adaptive seed generation approach that models knowledge-guided prompt selection as a multi-armed bandit problem. Our method first constructs two knowledge bases from API documentation and bug reports, then dynamically selects and refines prompt strategies based on real-time feedback. These strategies are tailored to the knowledge types in the respective bases. We design a multi-dimensional reward function to evaluate each batch of generated seeds by measuring their error-triggering potential and behavioral diversity, enabling a balanced exploration of both syntactically valid and bug-triggering test cases. Our experiments on three DL libraries, PaddlePaddle, MindSpore, and OneFlow, identify 17 previously unknown crash bugs, demonstrating the effectiveness and generalizability of the proposed approach.

Keywords:

DL library fuzzing; multi-armed bandit; retrieval-augmented generation

1. Introduction

In recent years, the development of deep learning (DL) has brought many conveniences and has revolutionized our lives [1,2]. They have been widely applied in many key scenarios, such as autonomous driving [3], facial recognition [4], and human-computer dialogue [5]. These DL applications are typically developed in the DL library, which provides many APIs for designing, training, and deploying the models [6]. These APIs are usually established in Python or C++ code; similar to other complex software, they will have bugs [7]. Recent studies, such as TitanFuzz [8,9], have uncovered some bugs in various APIs of PyTorch and TensorFlow, which may lead to program crashes or computational errors, thereby posing significant security risks. As a result, it is crucial to ensure the quality of DL libraries.

Fuzzing [10] is a significant methodology in detecting software bugs via random input generation. Due to its effectiveness, fuzzing has been used in DL library API testing [11]. FreeFuzz [12] first collects open-source code that calls DL library APIs and extracts the input and parameter constraints of each API from the execution of the collected code and constructs initial seeds. DocTer [13] extracts the input parameter constraints from API documentation. These constraints then guide its fuzzer to generate compliant inputs, constraint-violating inputs, and boundary-case inputs to uncover potential bugs. However, these techniques suffer from scalability issues. They may require extensive manual efforts when adapted to new DL libraries. For example, FreeFuzz cannot test APIs not involved in the open-source code with limited test coverage; it must collect target library code to adapt to new library testing. Similarly, DocTer also needs to extract DL library API documentation constraints to adapt to new DL library testing. Additionally, DocTer requires a lot of manual participation when constructing API parameter constraints.

In recent years, large language models (LLMs) [14] have demonstrated outstanding capabilities in code generation tasks. Researchers have started exploring the use of LLMs to automatically generate seeds for fuzzing, showing some early promise in producing diverse and semantically meaningful seeds [15,16,17]. TitanFuzz [8] pioneered the application of LLMs in the testing of DL libraries. By leveraging LLMs to automatically generate test code for each API, it uncovered multiple bugs in the PyTorch and TensorFlow libraries, significantly improved code coverage, and enabled more in-depth API testing, demonstrating impressive effectiveness. TitanFuzz [8] was the first work to introduce LLMs into the testing of DL libraries. Its approach begins by employing a generative LLM to produce initial seed programs, which are then mutated by masking critical code segments with different mutation operators. A fill-in LLM is subsequently used to complete the masked portions, thereby generating new test cases. Compared with traditional approaches, TitanFuzz uncovered multiple vulnerabilities in PyTorch and TensorFlow, achieved substantial improvements in code coverage, and enabled more in-depth API testing, demonstrating strong effectiveness. Building on this idea, FuzzGPT [9] further collected historical vulnerability code and fine-tuned LLMs accordingly, which led to improved performance over TitanFuzz but at the cost of significant computational and resource consumption. Owing to the generalization capabilities of LLMs, this approach is theoretically applicable to testing a wide variety of libraries, offering strong universality. Consequently, a range of LLM-based approaches have emerged following TitanFuzz [18]. Though LLM-based approaches can overcome the shortcomings of rule-based approaches, they still face certain limitations: (1) frequent syntactic errors, such as missing parameters or type mismatches, often render the generated code non-executable; (2) limited semantic depth, where the generated content may appear syntactically correct but fails to cover boundary conditions or trigger deeper bugs effectively. This motivates our core research question: How to maximize the vulnerability discovery capability of seeds while ensuring their syntactic validity?

To overcome these limitations, we integrate retrieval-augmented generation (RAG) to optimize and enhance the LLM’s seed-generation capabilities using external knowledge. Specifically, we select historical vulnerability reports and API documentation as our primary knowledge sources, employ web crawlers to collect multi-source data, and leverage the LLM to extract structured knowledge. We then construct two dedicated knowledge bases: a vulnerability knowledge base and a constraint knowledge base. Next, we design a multi-armed bandit (MAB) framework in which each arm corresponds to a prompt strategy that fuses different knowledge sources. In each iteration, the MAB framework dynamically selects one arm to generate a batch of test seeds. We introduce a multi-dimensional reward function that evaluates the effectiveness of each batch by measuring its likelihood of triggering potential defects and its behavioral diversity. This reward feedback guides the subsequent arm-selection policy. By combining RAG with MAB, our approach achieves a dynamic balance between exploration and exploitation across various knowledge-driven prompting strategies, producing syntactically valid and semantically rich test seeds, thereby substantially improving defect detection in DL libraries. To comprehensively evaluate the effectiveness of our approach, we conduct experiments on widely used DL libraries in comparison with two state-of-the-art fuzzing techniques. We further perform a series of experiments to systematically assess the seed generation quality and vulnerability detection effectiveness of the proposed method and conduct ablation studies to quantify the contribution of each component. The experimental results demonstrate that our approach achieved the highest seed validity and code coverage. And our approach has successfully discovered 17 previously unknown bugs, demonstrating the effectiveness and generalizability of our approach. In summary, we make the following contributions:

(1): We propose an adaptive seed generation approach for fuzzing DL libraries, which leverages multi-source knowledge and formulates the generation process as an MAB problem to balance seed validity and bug discovery capability dynamically.
(2): We implement this approach in a fully integrated fuzzing system that incorporates the concept of RAG. The system combines knowledge base construction, knowledge retrieval, LLM-driven seed generation, feedback-based reward computation, and adaptive strategy scheduling.
(3): We conduct comprehensive experiments on three widely used DL libraries: OneFlow [19], PaddlePaddle [20], and MindSpore [21]. We discovered 17 previously unknown bugs, demonstrating its effectiveness and generalizability.

The rest of the paper is structured as follows. Section 2 presents the background and research motivation. Section 3 introduces our methodology. Section 4 evaluates the performance of our method. Section 5 discusses the limitations of our method and potential future improvements. Finally, Section 6 concludes the paper.

2. Background and Motivation

In this section, we first present the DL libraries API and current bug categories. We then review RAG and its applications in vulnerability detection. Finally, we introduce the research motivation.

2.1. Deep Learning Libraries and Bugs

Deep learning libraries serve as the basic foundation for building artificial intelligence systems, providing developers with high-level APIs that cover the entire end-to-end lifecycle of AI development, including data preprocessing, model training, and performance evaluation [6]. These libraries abstract away low-level implementation details, enabling developers to construct and deploy DL models efficiently. During model training and inference, DL libraries form multi-layered call chains through complex interactions among components, leveraging core technologies such as multi-dimensional tensor operations, automatic differentiation, and distributed computing. However, API constraints or implicit behaviors may introduce potential vulnerabilities or system failures in development.

DL library bugs refer to any defects or shortcomings present in the library that cause its actual functionality to fail in meeting expected requirements or specifications. Currently, the types of bugs in DL libraries that researchers focus on include wrong computation and crash [22]. "Crash" refers to unexpected program failures during execution, as shown in Figure 1a. Examples include segmentation faults, program aborts, and INTERNAL_ASSERT_FAILED errors. "Wrong computation" refers to inconsistencies in computational results across different platforms, such as discrepancies between GPU and CPU outputs, as illustrated in Figure 1b. As noted in DocTer [13], “DL API functions should not crash even when receiving invalid inputs. Instead, they should handle such inputs gracefully (e.g., by throwing exceptions).” In this paper, we also concentrate on these two main types of bugs.

Moreover, current testing methods for DL libraries primarily focus on libraries like PyTorch and TensorFlow, while emerging libraries such as PaddlePaddle, MindSpore, and OneFlow also face risks of potential bugs. Therefore, conducting thorough testing on these newer libraries is highly necessary.

2.2. RAG for Vulnerability Detection

RAG is an artificial intelligence framework that enhances the capabilities of LLMs by integrating external knowledge sources [23]. RAG typically involves three stages: retrieval, augmentation, and generation [24]. Retrieval is the first step in the RAG process, where relevant information is extracted from a pre-established knowledge base based on the input query. Augmentation refers to incorporating the retrieved information as contextual input to LLMs, thereby enhancing the model’s comprehension and response capabilities for specific queries. Generation is the final stage of the RAG workflow, which leverages the retrieved information as contextual input and combines it with the capabilities of LLMs to produce accurate and contextually aligned text outputs. This approach effectively addresses limitations such as knowledge cutoff dates and mitigates the risk of hallucinations in LLM outputs.

Recently, RAG has been widely applied in vulnerability detection and has demonstrated promising effectiveness. LProtector [25], an automated vulnerability detection system for C/C++ codebases built upon GPT-4o and RAG, leverages RAG to inject domain-specific expertise, thereby enhancing its understanding of complex code structures. Vul-RAG [26] constructs a vulnerability knowledge base to retrieve relevant contextual information, further augmenting the vulnerability discovery capabilities of LLMs. Experimental results show that Vul-RAG outperforms existing methods across multiple benchmarks. Additionally, the vulnerability knowledge generated by Vul-RAG can assist human analysts in improving the accuracy of manual vulnerability detection. Additionally, Ref. [27] proposed a multi-level RAG framework for enhanced vulnerability analysis, which further strengthens the vulnerability detection capabilities of LLMs by integrating multiple vulnerability databases.

2.3. Motivation

With the rapid development of LLMs, LLM-based fuzzing of DL libraries has emerged as a mainstream approach. TitanFuzz [8] pioneered the use of LLMs in this context by first generating high-quality seeds with a generative LLM, then applying mutation operators to mask parts of the code, and finally using a fill-in LLM to complete the masked segments, thereby producing mutated code variants. Although TitanFuzz demonstrates great potential, the generated code often follows correct API usage patterns, which limits its effectiveness in triggering defects. To address this limitation, researchers have proposed various methods to enhance LLMs with external knowledge. FuzzGPT [9] constructs a training dataset by collecting and analyzing known defect-triggering programs from open-source repositories, followed by fine-tuning the LLM and applying in-context learning to improve its ability to generate abnormal test inputs. This approach achieves significant improvements over TitanFuzz in terms of code coverage and testing adequacy. However, fine-tuning requires substantial computational resources and is difficult to adapt to new DL libraries. FUTURE [18] focuses on extracting vulnerability knowledge from mainstream DL libraries and transferring it to target libraries such as MLX and MindSpore. By fine-tuning LLMs with carefully curated knowledge, FUTURE enables cross-library testing and successfully uncovers previously unknown defects, but it similarly incurs heavy computational costs. YanHui [28], in contrast, targets optimization errors in DL models by extracting domain knowledge from historical defect reports and patches and designing prompt templates that incorporate this knowledge to guide LLMs in generating highly targeted and efficient test cases.

In conclusion, these approaches leverage LLMs’ ability to generate seeds, making them highly generalizable across different DL libraries. Moreover, they incorporate external vulnerability knowledge to optimize LLMs’ seed-generation process, thereby producing seeds with higher quality. However, the high complexity of API parameters in DL libraries entails a multitude of implicit constraints, which impose significant limitations on existing LLM-based fuzzing methods. Firstly, the generated test seeds often contain syntax errors that prevent execution, severely undermining testing efficiency. For instance, when FuzzGPT employs a zero-shot strategy to test the TensorFlow library, its seed validity rate—the proportion of syntactically correct inputs—plummets to only 1.99%, highlighting both the ubiquity and gravity of this issue. Secondly, to thoroughly probe API behavior boundaries and expose deeper defects, current approaches must enhance the semantic richness and diversity of generated seeds to cover a wider spectrum of anomalous input scenarios. Therefore, the central objective of this study is to generate seeds that not only maintain syntactic validity but also maximize their ability to uncover vulnerabilities. Drawing inspiration from RAG techniques, we utilize external knowledge to enhance LLMs’ capabilities, facilitating the generation of high-quality seeds. However, addressing this question is non-trivial, as it involves overcoming two major obstacles:

(1): Challenge 1: What knowledge sources are available for fuzzing DL libraries, and how can they be effectively extracted and represented? Effective seed generation depends significantly on having high-quality domain knowledge. In DL libraries, knowledge sources can be quite diverse, including unstructured bug reports that highlight historical patterns of vulnerabilities as well as structured API documentation that outlines formal usage constraints. A major challenge is to systematically extract and encode these different types of knowledge into representations that are compatible with LLM-driven generation, all while maintaining both semantic richness and structural precision.
(2): Challenge 2: How can multiple knowledge sources be integrated into the seed generation process to achieve a balanced trade-off between validity and vulnerability exposure? Different types of knowledge impact seed quality differently. Constraint knowledge ensures syntactic correctness, while vulnerability knowledge enables deeper execution path exploration. Notably, selectively violating constraints can reveal hidden bugs. However, combining these knowledge sources without a clear strategy may yield invalid seeds or miss vulnerabilities. Thus, a dynamic, feedback-driven approach is needed to balance constraint adherence and violation, improving both seed validity and bug-finding effectiveness.

3. Methodology

In this work, we propose an adaptive seed generation approach for DL library fuzzing. Our method leverages a rich knowledge base to guide and constrain the seed generation process of LLMs, aiming to produce high-quality seed. To address the first challenge, we not only leverage historical vulnerability reports as a knowledge source but also extract constraint information from API documentation. In Section 3.1, we describe how we collect and construct this multi-source knowledge base. To address the second challenge, we use RAG to incorporate knowledge from different sources into our prompts, which helps to constrain and guide LLMs. In Section 3.2, we explain how RAG allows us to integrate different knowledge sources into various prompts. To maximize the effectiveness of our knowledge base, we create several prompts, each containing distinct information, and develop a multi-armed bandit model with a reward function to provide real-time feedback on prompt selection. Section 3.3 outlines our rationale for prompt design and the specific mechanics of the multi-armed bandit model. After generating the seeds, we use a test oracle to identify bugs. In Section 3.4, we present our test oracle. Figure 2 illustrates an overview of our approach, which consists of the following three phases.

Phase 1: Knowledge Base Construction (Section 3.1): We first collect historical bugs from popular DL library repositories on Github, extract multi-dimensional vulnerability knowledge, and build a vulnerability knowledge base. In addition, it also extracts constraint knowledge from the API documentation of DL libraries and builds the constraint knowledge base.
Phase 2: Knowledge Retrieval (Section 3.2): For a given API under test, our approach separately retrieves relevant knowledge from the vulnerability knowledge base and the API constraint knowledge base.
Phase 3: Knowledge-Enhanced Seed Generation (Section 3.3): We construct multiple prompts as arms in the MAB framework, each grounded in different retrieval knowledge sources. During each iteration, it selects one arm to generate a batch of test seeds and updates the arm selection strategy for subsequent iterations by evaluating the reward associated with the generated outputs.

3.1. Knowledge Base Construction

To comprehensively capture vulnerability and constraint information for APIs in DL libraries, we propose multi-dimensional representations for vulnerability knowledge (in Section 3.1.1) and constraint knowledge (in Section 3.1.2). Building on this knowledge representation framework, we employ LLMs to extract relevant vulnerability knowledge from existing bug reports and construct a vulnerability knowledge base (in Section 3.1.3). Similarly, it leverages LLMs to extract constraints from the DL library’s documentation and build an API constraint knowledge base (in Section 3.1.4).

3.1.1. Vulnerability Knowledge Representations

We found similarities in the APIs of different DL libraries, as shown in Table 1. The Conv2d API is available in PyTorch, TensorFlow, and PaddlePaddle, all of which provide similar functionality, and most of the parameters in the different libraries are consistent. Based on this, we hypothesize that vulnerabilities in these libraries may have commonality and that knowledge of vulnerabilities in one library’s API can be transferred to other libraries. Inspired by this process, we believe that vulnerabilities in existing DL libraries can be collected to guide the testing of other libraries.

Moreover, we observe that DL libraries have numerous bug reports on GitHub, where developers and users frequently report encountered bugs to facilitate fixes by library maintainers. These bug reports typically include reproduction code, error logs, and other diagnostic information, which we regard as valuable sources of historical vulnerability knowledge. We characterize the knowledge of issue instances from four dimensions, as summarized in Table 2.

In addition to GitHub bug reports, the Common Vulnerabilities and Exposures (CVE) repository provides security vulnerability data for DL libraries such as PyTorch and TensorFlow. However, these CVE records largely duplicate the information already available in the corresponding GitHub issues (e.g., CVE-2025-2148 corresponds to Issue #147722 in the PyTorch repository). For this reason, bug reports are prioritized as the primary knowledge source in this study.

3.1.2. Constraint Knowledge Representations

Modern DL libraries provide comprehensive API documentation to facilitate correct usage. These documents explicitly specify parameter constraints, input/output shapes, and usage examples, collectively forming a rich source of constraint knowledge.

On the one hand, such constraint knowledge can guide LLMs to generate syntactically valid seeds. On the other hand, deliberately violating these constraints can also expose latent bugs. As illustrated in Figure 3a, we discovered a segmentation fault in PyTorch triggered during the use of torch.lstm_cell. Upon careful analysis, we identified that the root cause lies in the violation of parameter constraints within torch.lstm_cell. Specifically, the size(0) (i.e., the first dimension) of parameters such as w_ih and w_hh is not properly validated, although it is required to be

4 \times hidden_size

. As a result, the unsafe_chunk operation fails to correctly partition the gates into four parts. Consequently, due to the illegal size(1) (i.e., the second dimension) of the gates tensor, an out-of-bounds access occurs when invoking operator[], ultimately leading to a segmentation fault. This case underscores the importance of adhering to the documented parameter constraints of torch.lstm_cell, which are detailed in Figure 3b.

Consequently, we propose a multi-dimensional representation that explicitly encodes parameter dependencies, shape requirements, and operational boundaries to capture the constraints embedded in API documentation systematically. This representation dissects the API’s knowledge into structured components, enabling precise retrieval and reasoning about constraints. The five dimensions of this representation are summarized in Table 3.

By explicitly distinguishing parameter-level, global, and implicit constraints, it provides a comprehensive abstraction of API knowledge, facilitating efficient retrieval and utilization in subsequent analysis.

3.1.3. Vulnerability Knowledge Extraction

To systematically collect bug reports from DL libraries, we developed an automated spider to crawl the content of each issue. Specifically, our spider utilizes GitHub’s API to search for Issue lists tagged as “Bug,” and iterates through each page. Subsequently, we scan pages for code snippets containing keywords such as “code to reproduce the issue,” “usage example,” or “code example,” and save these pages. We then leverage LLMs to eliminate HTML code from the retained content, preserving only the textual information.

For each bug report I, the involved API name

N a m e

, error causes

C a u s e s

, error symptoms

S y m p t o m s

, and example code

C o d e

are structured and stored as a quadruple

I s s u e I n f o = {N a m e, C a u s e s, S y m p t o m s, C o d e}

. LLMs possess strong semantic comprehension and logical reasoning capabilities, enabling them to analyze issue content, perform deductive reasoning, and deliver accurate analyses. Therefore, for each issue instance, we employ LLMs to extract four-dimensional knowledge, thereby achieving generalizable representations of vulnerability patterns.

To accurately extract vulnerability knowledge

I s s u e I n f o

from issue content, we designed a universal prompt template to structure the extraction process. The constructed prompt comprises three components:

Task Definition: A formal instruction specifying the goal of extracting structured vulnerability knowledge from unstructured issue text I. Issue text: The Issue text data processed through HTML removal, key information extraction, and structural formatting. Output Examples: To ensure the extracted knowledge conforms to structural requirements, the data are formatted into a JSON object that strictly adheres to the

I s s u e I n f o

schema for downstream integration.

Figure 4 illustrates a complete example of vulnerability knowledge extraction from issue ID 124900 in the PyTorch repository. First, we used a crawler to obtain the issue content and removed HTML tags and irrelevant formatting to obtain the cleaned core text, as shown on the left. This includes the issue title, sample code, error messages, and cause analysis, which together serve as the raw input for knowledge extraction. The central panel presents the prompt designed to guide the LLM, which specifies the task objectives and provides examples to standardize the output format and element representation. The right panel shows the structured vulnerability knowledge extracted from this issue, covering the key components of API name, error causes, error symptoms, and example code. Considering that a single bug report may involve multiple APIs and thus provide valuable guidance for testing several APIs simultaneously, we further vectorize the extracted results and store them in a Chroma vector database to support efficient semantic retrieval and cross-API knowledge association.

3.1.4. Constraint Knowledge Extraction

To systematically collect API constraint knowledge from DL library documentation, we design a multi-stage extraction process.

First, we design an automated spider to retrieve the API documentation from major DL libraries websites. Since the HTML pages typically follow consistent structural patterns, we parse them into individual files, each corresponding to a single API entry. We utilize LLMs to remove HTML code from the retained content, keeping only the textual information.

Subsequently, to extract constraint knowledge, we design a prompt to further analyze these files. Following the multi-dimensional representation of constraint knowledge introduced in Section 3.1.2, we construct a specialized prompt as detailed below.

This process ensures that both syntactic structure and semantic constraint knowledge are systematically preserved and made available for downstream analysis, such as seed generation, constraint checking, and vulnerability detection.

In summary, during the knowledge base construction phase, we employ an LLM to extract relevant information from API documentation and historical issues, building a vulnerability knowledge base and a constraint knowledge base, respectively. Although the LLM may exhibit limited precision or hallucinations during extraction, which can introduce certain erroneous knowledge items and cause occasional syntax anomalies in individual seeds, the positive impact of high-quality knowledge substantially outweighs these shortcomings, and the overall effect on the effectiveness of the fuzzing process remains minimal.

3.2. Knowledge Retrieval

For a given target API, we can directly retrieve its constraint knowledge based on the API name. In contrast, since the vulnerability knowledge is stored in a vector database, retrieving relevant vulnerability information requires a two-step process: query construction and candidate knowledge retrieval.

3.2.1. Query Generation

In DL library fuzzing, our input is under test API. On the one hand, relying on the API name is often insufficient to retrieve relevant vulnerability knowledge from the knowledge base, especially when applying the vulnerability knowledge base to other DL libraries. On the other hand, a single bug report may involve multiple APIs and thus provide valuable guidance for testing several related APIs. To address this challenge, we first construct a retrieval query for the input API by extracting its functional semantics using LLMs. Specifically, the prompt used for query construction is as follows.

This prompt enables the application of vulnerability knowledge from PyTorch to PaddlePaddle by querying semantically similar APIs along with their functional descriptions. It facilitates effective retrieval from the PyTorch-based vulnerability knowledge base, even when testing APIs from other libraries.

3.2.2. Candidate Knowledge Retrieval

We use the query content generated by the above LLM to perform a similarity-based search. For each query content, we first vectorize it using an embedding model, after which we use the Hierarchical Navigation Small World (HNSW) [29] algorithm to retrieve the first n candidate knowledge items efficiently. HNSW is an advanced approximate nearest neighbor search algorithm that constructs a hierarchical graph structure to enable fast and accurate retrieval even in high-dimensional spaces. The algorithm’s efficiency and scalability make it particularly suitable for large-scale knowledge retrieval tasks.

The retrieval process is based on the similarity between each query element (API name and basic functionality) and the corresponding elements of the knowledge items. The similarity score is computed using the cosine similarity metric, which measures the cosine of the angle between the vector representations of the query and the knowledge item. This approach ensures that the retrieved items are semantically relevant to the query items. The retrieval process is formalized as follows, where

D

denotes the knowledge base document collection,

q

represents the query vector, and

d

corresponds to the knowledge item vectors within the knowledge base.

Retrieved Items = \underset{d \in D}{Top - n} (\frac{q \cdot d}{∥ q ∥ \cdot ∥ d ∥})

(1)

3.3. Knowledge-Enhanced Seed Generation

To test a target API i, we need to generate test code that invokes i. We leverage the powerful generative capabilities of LLMs to create such test code for API i. Based on the preceding discussion, we have constructed a vulnerability knowledge base and a constraint knowledge base. We aim to use historical vulnerability knowledge to guide LLMs in generating bug-prone test code while employing constraint knowledge to regulate the generation process, ensuring compliance and validity of the generated code to avoid low-level errors such as syntax issues. In addition, as shown in Figure 3, we found that intentionally violating certain constraint knowledge is also capable of triggering bugs.

Based on this observation, we argue that constraint knowledge can serve a dual purpose: it can prevent low-level errors by constraining LLM outputs, and it can also be deliberately violated to push boundaries and guide LLMs in generating error-prone test cases. Building on this insight, we frame the seed generation process based on the vulnerability knowledge base and constraint knowledge base as an MAB problem as follows:

Multi-Armed Bandit. Suppose there are K arms available; each arm

i \in {1, 2, \dots, K}

is associated with an unknown reward distribution

D_{i}

. At each time step t, the decision-maker selects an arm

i_{t}

and receives a reward

r_{t}

drawn from the reward distribution

D_{i_{t}}

. The objective of the decision-maker is to maximize the cumulative reward over T rounds of trials:

R_{T} = \sum_{t = 1}^{T} r_{t}

(2)

By modeling the seed generation process as an MAB problem, we can dynamically explore and exploit these strategies based on their historical effectiveness. The following sections detail our arms design, the reward function, and the strategy update mechanism that guides the learning process.

3.3.1. Arms Design

In the RAG workflow, the relevant knowledge retrieved based on the query is incorporated into the prompt to construct a new, enriched prompt. This enhanced prompt is then used by the LLM for generation, thereby improving the quality of the generated content. To address our exploration of different strategies, we designed four prompts based on variations in the knowledge utilized, with each prompt representing a distinct “arm” in the MAB framework. The detailed definitions of these terms are provided below.

Arm_rag_historical_vuln. This prompt emphasizes leveraging patterns, triggering conditions, and data types from the historical vulnerability knowledge base to guide the LLM in generating inputs similar to those that historically caused issues.

Arm_api_compliant. This prompt focuses on utilizing constraints from the constraint knowledge base, such as types, ranges, orderings, and dependencies, to guide the LLM in generating valid, structurally complex API invocation sequences that comply with specifications. This facilitates the testing of deep logic and combinatorial functionalities.

Arm_violate_constraints. This prompt guides the LLM to deliberately violate constraints defined in the constraint knowledge base, such as types, ranges, orderings, and dependencies, thereby generating API invocation sequences that exceed specified boundaries. This approach is instrumental in testing the boundary conditions of APIs, uncovering potential vulnerabilities or unexpected behaviors under extreme or non-compliant scenarios.

Arm_mixed_strategy. This prompt leverages both the vulnerability knowledge base and the constraint knowledge base to guide the LLM in generating test code that is bug-prone yet compliant with usage specifications. This approach strikes a balance between compliance and bug induction, facilitating the systematic discovery of deep vulnerabilities while maintaining the validity and executability of the generated code.

Each of the four prompt designs corresponds to a unique strategy for seed generation. During fuzzing, the framework selects one arm per iteration and employs the associated prompt to guide the generation of a batch of seeds. This design enables diverse test behaviors and supports adaptive exploration of the input space.

The prompt shown below corresponds to the specific design of Arm_rag_historical_vuln. The core idea is as follows: the prompt first specifies the task objective, namely generating vulnerability-oriented usage code for a given deep learning library and target API. It then incorporates essential contextual information, particularly historical vulnerability knowledge from TensorFlow and PyTorch APIs with similar semantics, which provides transferable insights and concrete references to guide the generation process. Finally, strict output requirements are appended to ensure that the generated code is complete, executable, and consistent. It is worth noting that the other arms follow a similar design philosophy, all relying on the same structured template but differing in the types of knowledge they embed.

3.3.2. Reward Function Definition

After selecting different arms, it is necessary to compute the reward associated with the seeds generated by the current arm to select the optimal arm in the next round and ultimately achieve the MAB algorithm’s goal of maximizing cumulative reward. Focusing on bug detection capability, we design a multi-dimensional reward function that captures the bug detection potential and functional richness of each batch of seed. The total reward R for each batch is computed as a weighted sum of four components:

R = R_{crash} + R_{coverage} + R_{structure} + R_{output}

(3)

This formulation reflects a holistic evaluation strategy, balancing the goals of discovering critical failures, increasing execution path diversity, promoting structural complexity, and capturing semantic anomalies in output.

Crash and Exception Reward ( $R_{crash}$ ). Crashes and runtime exceptions are strong indicators of software vulnerabilities. However, not all failures are equally valuable—some crashes may be redundant, while others may signal severe memory or assertion faults. To prioritize semantically distinct and severe crashes, we combine program exit codes with error messages to accurately identify the specific type of failure. The reward is computed as

R_{crash} = \sum_{c \in C} w_{c}

(4)

where C is the set of unique crash types triggered by the test case, and

w_{c}

is the weight assigned to each type. Specifically, we assign a weight of 50.0 to both generic crashes and memory errors, 45.0 to program hang, and 20.0 to assertion failures. This prioritization reflects the severity and exploitability of each crash type, encouraging test strategies capable of exposing diverse and meaningful faults.

Code Coverage Reward ( $R_{coverage}$ ). Beyond direct failure signals, the ability of a test case to explore new execution paths is a strong proxy for its long-term utility. We measure line-level coverage using the coverage.py module and define the reward based on the number of newly covered lines:

R_{coverage} = w_{cov} \cdot Δ L

(5)

where

Δ L = L_{after} - L_{before}

, and the weight for each newly covered line is set to

w_{cov} = 0.5

. This component incentivizes broader code exploration, enabling access to hidden branches and deep logic.

Code Structure Reward ( $R_{structure}$ ). To assess syntactic and semantic richness, we analyze both API diversity and code complexity. Let U denote the number of unique API calls, and let

Δ d

represent the increase in AST depth. The reward is computed as:

R_{structure} = w_{api} \cdot U + w_{depth} \cdot Δ d

(6)

Here, we assign a weight of

w_{api} = 1.0

to each unique API call and

w_{depth} = 0.5

to each unit of depth increase. This encourages tests that are structurally diverse and capable of triggering deeper semantic interactions among APIs.

Output Analysis Reward ( $R_{output}$ ). Output correctness and stability are critical, especially for numerical or generative APIs. We evaluate the output in terms of numerical anomalies, structural form, and novelty:

\begin{matrix} R_{output} = & w_{nan_inf} \cdot I_{nan / \inf} \\ + w_{abn_shape} \cdot I_{abnormal shape} \\ + w_{novel} \cdot I_{novel output} \end{matrix}

(7)

Each indicator function

I_{*} \in {0, 1}

detects the presence of a specific anomaly. We assign a weight of 5.0 if the output contains NaN or INF values, 2.0 if the output shape is abnormal, and 1.0 if the output exhibits novel behavior. This component complements structural and crash-based evaluations by capturing subtle semantic issues in the output space.

By integrating these four dimensions, our reward function provides fine-grained, multi-perspective evaluation signals. It supports a learning-driven test generation process that promotes both vulnerability discovery and behavioral diversity. The next section details how this reward is used to update strategy selection over time.

3.3.3. Strategy Update Mechanism

To effectively balance the trade-off between exploitation (selecting the best-reward strategy) and exploration (trying less-tested strategies with potential), we adopt the Upper Confidence Bound (UCB) [30] algorithm as our strategy update mechanism. UCB is a popular approach in the MAB algorithm that selects actions based on both the estimated reward and the uncertainty associated with each arm, making it well-suited for adaptively choosing among multiple test generation strategies.

Formally, let K be the number of arms, and let t denote the current time step. For each arm

i \in {1, 2, \dots, K}

, we maintain:

$n_{i} (t)$ : the number of times arm i has been selected up to time t,
${\bar{R}}_{i} (t)$ : the empirical mean reward obtained from arm i,

The UCB score for each arm is then computed as

U C B_{i} (t) = {\bar{R}}_{i} (t) + α \cdot \sqrt{\frac{2 log t}{n_{i} (t)}}

(8)

where

α > 0

is a tunable hyperparameter that controls the degree of exploration. At each round t, the arm with the highest

U C B_{i} (t)

value is selected for test case generation. This selection rule ensures that arms with higher observed performance are preferred, while those with fewer trials still retain a chance to be explored due to the logarithmic bonus term. To initialize the UCB algorithm, we perform one selection for each arm during the first K rounds, ensuring that every arm has at least one observation and avoiding division by zero in the denominator of the confidence term.

This strategy ensures that the framework dynamically concentrates resources on test generation strategies that consistently yield valuable outcomes, while still periodically revisiting less-utilized arms that may become effective in later stages of the fuzzing process. The use of UCB also allows the system to respond to evolving dynamics in test environments or model behaviors without relying on fixed scheduling or heuristic rules.

3.4. Test Oracle

During the seed generation phase, we dynamically execute each generated seed to compute the reward for the corresponding arm. Since crashes represent a primary source of reward, our work places particular emphasis on the detection and analysis of crash-related bug types. Specifically, we monitor for critical runtime bugs such as segmentation faults, INTERNAL_ASSERT_FAILED, and floating-point exceptions messages. These crashes typically indicate the failure of the target system to properly validate or handle malformed inputs or edge conditions, which may reveal potential security vulnerabilities.

In summary, Algorithm 1 provides an overview of the proposed knowledge-enhanced adaptive fuzzing workflow. The framework first integrates constraint knowledge and vulnerability knowledge to perform retrieval for the target API and then constructs multiple testing strategies (arms) based on different ways of leveraging these knowledge sources. Using the UCB algorithm, the framework adaptively selects the most promising strategy in each iteration to guide seed generation. Through iterative execution and reward feedback, the testing process is continuously optimized, and with the support of the test oracle, the framework can effectively uncover potential vulnerabilities in deep learning libraries.

Algorithm 1: Knowledge-Enhanced Adaptive Fuzzing Framework

4. Implementation and Evaluation

We evaluate the effectiveness and usefulness of our approach by answering the following three research questions:

RQ1: How does the quality of the seeds generated by our approach compare to existing methods?
RQ2: How do the key components of our approach contribute to its effectiveness?
RQ3: Is our approach able to detect real-world bugs?

4.1. Implementation

First, we use DeepSeekV3 [31] for knowledge base construction, which is one of the most effective LLMs currently available, with excellent comprehension and reasoning capabilities. For constructing the vulnerability knowledge base, we selected two of the most widely used DL libraries: PyTorch [32] and TensorFlow [33]. By analyzing and processing the collected bug reports, we ultimately obtained 262 and 1015 vulnerability knowledge entries for the two libraries, respectively. In building the constraint knowledge base, we expanded the scope to include five mainstream DL libraries: PyTorch [32], TensorFlow [33], OneFlow [19], PaddlePaddle [20], and MindSpore [21]. Table 4 presents the version of the five DL libraries along with the number of extracted API constraints. For knowledge-based seed generation, we use deepseek-r1-distill-qwen-14b [34]. At the same time, we use text-embedding-v3 as an embedding model to encode knowledge into the database. In the knowledge base construction phase, we set temperature=0.1; in the seed generation phase, we set temperature = 0.6.

4.2. Baseline

We select both traditional representative approaches and recent popular LLM-based methods for comparison in DL library testing. The selected works are as follows.

FreeFuzz extracts parameter information from open-source code to construct a comprehensive parameter value space, then applies mutation strategies to enable fully automated testing of DL libraries. They conducted experiments on PyTorch and TensorFlow.

DeepREL automatically infers APIs within a DL library that share similar input–output relationships and leverages known APIs’ test inputs to validate and fuzz-test these inferred functions. They conducted experiments on PyTorch and TensorFlow.

Muffin generates a diverse set of DL models and performs cross-library differential testing during the training phase, thereby exhaustively exploring library code and uncovering additional defects. They conducted experiments on Tensorflow, CNTK, and Theano.

TitanFuzz is the first approach to employ LLMs to produce valid, complex DL program inputs, achieving zero-shot automated fuzz testing of DL libraries. They conducted experiments on PyTorch and TensorFlow.

FuzzGPT combines historical bug-triggering code with LLM fine-tuning and in-context learning to automatically generate edge-case samples. They also conducted experiments on PyTorch and TensorFlow.

Our approach differs from previous methods by incorporating historical vulnerability knowledge as well as constraint knowledge from API documentation to guide LLMs’ seed generation. Additionally, we design multiple prompts with different types of knowledge and utilize an MAB model for dynamic prompt selection. This strategy enhances the comprehensiveness of our testing.

4.3. Metrics

Detected bugs. Following prior work on fuzzing DL libraries [9,12], we report the number of detected bugs.

Line coverage. Following prior work [8], we introduce the number of line coverage, which has been widely adopted in software testing and recently DL library testing. Line coverage serves as a critical metric reflecting the breadth of testing for DL libraries.

API coverage. Following prior work [12], we report the number of covered APIs as another important metric of test adequacy in DL libraries which typically have thousands of APIs.

Valid Rate. It refers to the proportion of effective seeds among all generated seeds. The valid rate is given as

Valid Rate = \frac{N_{V a l}}{N_{A l l}}

.

4.4. RQ1: Comparison of Seed Generation Quality with Existing Methods

We first evaluate the quality of the seeds generated by our approach to verify its effectiveness. Seed quality is a broad concept, and following prior work, we assess it from three perspectives: overall code coverage, the rate of valid seeds, and API coverage. Experiments are conducted on PyTorch (v1.12) and TensorFlow (v2.10), following the same setup as used in FuzzGPT and other related studies [9].

Firstly, we compare our proposed approach with several traditional methods, including FreeFuzz, DeepREL, and Muffin. The experimental results are summarized in Table 5. For the PyTorch library, our method achieves 33.59% line coverage and 94.91% API coverage. This line coverage significantly surpasses the 13% achieved by baseline methods such as FreeFuzz and DeepREL. Similarly, on TensorFlow library, our approach reaches 46.29% line coverage and 94.99% API coverage. This performance starkly contrasts with the 30% line coverage of existing tools, while our API coverage remains exceptionally high. These findings suggest that the semantic understanding and generation capabilities inherent to LLMs facilitate a more effective exploration of API call paths within DL libraries, leading to broader and more thorough test coverage.

We further evaluated our approach with the state-of-the-art LLM-based methods, TitanFuzz and FuzzGPT, on the PyTorch and TensorFlow libraries. The comparison was adjudicated by three core metrics: line coverage, valid rate, and API coverage. The results are summarized in Table 6 and Table 7.

Table 6 reports the API coverage achieved by each method on two libraries. Our method consistently outperforms all baselines, achieving the highest API coverage on both PyTorch and TensorFlow libraries. Specifically, our method surpasses the best-performing baseline by approximately 8% on PyTorch and by 25% on TensorFlow. This demonstrates our method’s superior capability in exercising diverse and deep API call paths, which is critical for exposing hidden vulnerabilities that depend on intricate API interactions. Table 7 presents a detailed comparison of line coverage and valid seed rate achieved by our method against LLM-based methods. The results unequivocally demonstrate the superiority of our approach across both the PyTorch and TensorFlow libraries. On PyTorch, our method achieves 33.59% line coverage and a 30.82% valid seed rate, significantly outperforming the strongest baseline, FuzzGPT. The performance gap widens notably on TensorFlow, where our approach attains a striking 46.29% line coverage—marking a substantial improvement of 1.99 percentage points over FuzzGPT’s 44.30%.

This demonstrates the advantage of our approach, which effectively balances seed validity and vulnerability discovery capability. We attribute this advantage to two key factors: the incorporation of rich domain knowledge to guide LLM seed generation and the use of the MAB model to dynamically tailor the seed generation strategy.

4.5. RQ2: Contribution of Key Components to the Effectiveness of Our Approach

In our approach, we utilize an MAB model, where the arms and the reward function are critical components. To assess the individual contribution of each component to the overall performance, we conducted a series of ablation studies. We randomly sampled 100 APIs from three DL libraries, conducted the ablation study experiments five times, and reported the average results.

Arms: In our method, we treat prompts constructed from different types of knowledge as individual arms. Since each arm has the opportunity to be selected, their design is of critical importance. We first investigate the effectiveness of the arms themselves. We evaluated the performance of the four arms described in Section 3.3.1 across three DL libraries. As a baseline, we included an unguided strategy that operates without prior knowledge. The results are summarized in Table 8.

Compared to the without_any_strategy baseline, both the rag_historical_vuln and mixed_strategy approaches incorporate external vulnerability knowledge. As demonstrated in tests across three DL libraries, seed validity slightly decreases, while code coverage improves. For example, in the PaddlePaddl’s experiments, seed validity under the without_any_strategy baseline is 14.38%, whereas it drops to 12.16% and 12.00% with the rag_historical_vuln and mixed_strategy methods, respectively. Meanwhile, code coverage rises from 21.07% to 21.54% and 21.18%. Our analysis of the generated seed code reveals that the external vulnerability knowledge is related to PyTorch and TensorFlow, causing the LLM to include PyTorch or TensorFlow APIs in parts of the code it generates for the target DL library, which in turn reduces seed validity. Nevertheless, the incorporation of external vulnerability knowledge also enhances code coverage, and in some seeds, we can observe the LLM’s reasoning process regarding this external knowledge, thereby validating the effectiveness of our strategies.

Our experiments show that, compared with the without_any_strategy baseline, leveraging constraint knowledge significantly boosts both code coverage and seed validity. By guiding the LLM’s generation process and preventing errors such as invalid parameter usage, constraint knowledge ultimately enhances the quality of the generated seeds. Moreover, we observed an interesting phenomenon: under the violate_constraints strategy, both seed validity and code coverage still improved over the without_any_strategy baseline but remained lower than with the leveraging constraint strategy. This is because DL libraries often incorporate implicit fault-tolerance mechanisms, such as automatic type coercion, tensor broadcasting, and default parameter filling. These mechanisms enable certain constraint-violating inputs to still execute successfully and even reach deeper logic branches. As a result, seed validity is counter-intuitively improved.

In summary, each of our designed arms targets a distinct aspect. The integration of external knowledge enhances the quality of seeds generated by the LLM, and when coordinated by the MAB model, their performance improves even further.

Reward Function: To evaluate the contribution of each reward component in guiding effective seed generation, we conduct a series of ablation experiments by selectively removing individual components from the overall reward function. Specifically, we evaluate four partial reward:

R_{crash}

,

R_{coverage}

,

R_{structure}

, and

R_{output}

, each corresponding to the exclusion of crash detection, coverage guidance, AST structural reward, and output behavior feedback, respectively. We evaluate the performance across three DL libraries, and the results of this analysis are presented in Table 9.

The ablation results in Table 9 clearly demonstrate the importance of each component in the reward function. The full reward setting (R) consistently achieves the highest coverage and validity across all three DL libraries, confirming the effectiveness of our composite reward design in guiding high-quality seed generation. Notably, removing the coverage reward (

R_{coverage}

) leads to the most significant drop in validity, especially on MindSpore (from 5.28% to 4.20%), highlighting its critical role in steering the generation process toward functionally meaningful inputs. Additionally, removing the structure reward (

R_{structure}

) causes a sharp decline in validity on OneFlow (from 16.96% to 10.93%), indicating that syntactic and semantic guidance is essential for maintaining the correctness of generated seeds. In contrast, excluding the crash reward (

R_{crash}

) and the output reward (

R_{output}

) results in relatively moderate performance degradation. For instance, the removal of

R_{output}

has minimal effect on validity in OneFlow but a more noticeable impact on MindSpore, suggesting platform-specific sensitivity to output behavior.

Overall, no single reward component can independently achieve optimal performance. Each part contributes uniquely to different aspects of seed effectiveness—such as behavior triggering, structural validity, and execution semantics—underscoring the necessity of a holistic and multi-dimensional reward strategy.

4.6. RQ3: Detection of Real-World Bugs Using Our Approach

We evaluated three DL libraries using our approach, and the summary of the detected bugs is presented in Table 10. In total, 51 bugs were identified, of which 17 were confirmed to be previously unknown. These bugs have been reported to the developers for further verification and confirmation. And these bugs may cause the program to crash during execution, thereby potentially leading to denial-of-service attacks. A detailed list of the bugs can be found in our GitHub repository (https://github.com/deepliao/Bug_list, accesssed on 24 September 2025). Below, we present examples of the bugs detected by our approach, which originate from different libraries.

Figure 5a illustrates a representative example of discovering a bug through the Arm1 strategy, driven by historical vulnerability knowledge transfer. The upper part of the Figure 5a presents a known historical bug: in this case, torch.CharStorage, a low-level memory structure in PyTorch, was incorrectly used as the output target for torch.save, despite not implementing a file-like interface. Due to the absence of essential methods, this misuse ultimately triggered an abort exception. We collected this issue as an entry in our vulnerability knowledge base.

When testing torch.BoolStorage, our method employed a vector-based semantic retrieval mechanism to match this historical issue and automatically injected its structured summary into the seed-generation prompt for the LLM. During the generation process, the model, guided by the injected vulnerability knowledge, demonstrated an initial understanding of the underlying failure mechanism, as shown in the green comment section, and attempted to transfer the same logic to BoolStorage. The lower part of Figure 5a shows the resulting test case, exemplifying how historical bugs can effectively drive reasoning and code construction in language models. This case clearly demonstrates that incorporating vulnerability knowledge can significantly enhance the quality and relevance of generated test cases.

Figure 5b illustrates a bug triggered by Arm2 invoking torch.Tensor.addcdiv in a manner consistent with its API documentation, yet resulting in an internal runtime failure. Specifically, executing the code causes an INTERNAL ASSERT FAILED error from the underlying C++ backend. According to the official documentation, addcdiv performs an element-wise division between two tensors, scales the result by a scalar value, and adds it to the input tensor. However, the documentation does not explicitly state that the input tensor (i.e., self) must be of a floating-point type. In our case, we passed an int64 tensor as the input, along with two float32 tensors for the division operands. While the usage appears semantically valid based on the API description, the backend fails to handle this type mismatch and crashes due to an assertion failure. This bug reveals a gap between the documented interface and the actual implementation behavior, highlighting a lack of type safety enforcement.

Figure 5c illustrates a bug triggered by Arm3 due to a violation of the API constraint. Specifically, executing the code results in a Segmentation fault (core dumped) error. According to the documentation of flow.IntTensor, the parameter data is expected to be one of the following types: list, tuple, NumPy ndarray, scalar, or tensor. However, we passed a string input ’invalid’, which clearly violates the interface specification. The framework fails to perform type checking at the Python frontend, and the invalid input is forwarded directly to the underlying C++ implementation. This leads to undefined behavior and ultimately results in a segmentation fault. This bug exposes a deficiency in the framework’s input validation mechanism, highlighting potential weaknesses in its robustness and security, especially in handling boundary inputs and exceptional types.

Our analysis indicates that the root cause of these bugs lies in the lack of standardized API design. This is reflected in ambiguous interface contracts, inconsistencies between documentation and implementation, and insufficient input constraints and error-handling mechanisms. Such design deficiencies lead to unpredictable behaviors in complex application scenarios, ultimately causing program crashes and exposing potential security risks. We therefore recommend that developers of DL libraries place greater emphasis on standardized API design in future development. In particular, interface contracts should be clearly specified, documentation and implementation must remain strictly aligned, and robustness and security considerations should be integrated into the design stage. These measures can address the fundamental sources of such bugs and enhance the reliability and safety of DL libraries.

5. Discussion

This section describes the current limitations of our method and our plans to resolve them as part of our future work.

Knowledge Base Completeness and Quality Dependence: The effectiveness of our approach relies heavily on the completeness and quality of the knowledge base constructed. When API documentation is imprecise or incomplete, the system may fail to extract accurate constraints, leading to the generation of invalid or low-value seeds. Similarly, in the absence of historical vulnerability reports, the framework loses an essential source of triggering patterns and error conditions, which may result in overly generic inputs that are less effective in uncovering new vulnerabilities. Even when such reports are available, their usefulness is strongly influenced by quality: detailed and well-structured reports provide valuable guidance for retrieval-augmented prompts, whereas poorly written or unstructured reports may introduce noise and weaken the overall effectiveness. As a result, the framework may perform unevenly across different libraries. Well-documented and extensively studied libraries are more likely to benefit, while emerging or poorly documented libraries may see only limited improvements. As part of future work, we plan to investigate automated techniques for extracting semantic constraints and vulnerability patterns from source code, commit histories, and online developer discussions, thereby reducing reliance on manually curated high-quality external documents.

Execution Overhead and Scalability: Another limitation arises from the computational cost introduced by the iterative MAB mechanism. Each iteration involves prompt selection, knowledge retrieval, LLM-based seed generation, constraint validation, execution within the target library, and reward computation. Although this process effectively improves seed quality in an adaptive manner, it also incurs considerable overhead. The issue becomes particularly pronounced during the exploration phase of MAB, where a large number of iterations may be spent evaluating suboptimal prompts, consuming significant resources without yielding proportional benefits. As fuzzing tasks expand to cover hundreds of operators or extend over long testing periods, the cumulative cost of repeated MAB updates may become prohibitive, ultimately limiting scalability. To address this challenge, we plan to explore lightweight bandit algorithms, approximate reward modeling, and distributed or batched execution strategies. These methods can reduce per-iteration costs while maintaining adaptive efficiency, thereby making the framework more practical for deployment in large-scale, real-world fuzzing scenarios.

6. Conclusions

In this paper, we propose a novel method for DL libraries fuzzing that emphasizes both the validity and vulnerability of generated seeds. To ensure that the code generated by LLMs is syntactically correct and capable of triggering bugs, we enrich prompts with external knowledge, including constraint information from API documentation and vulnerability data from historical bug reports. These types of knowledge are organized into two separate knowledge bases, and the prompt selection process is formulated as a multi-armed bandit problem to enable adaptive optimization. A multi-dimensional reward function guides the optimization process by jointly evaluating the bug-triggering potential and behavioral diversity, thus balancing code correctness with vulnerability exposure. Experiments conducted on three widely used DL libraries uncovered 17 previously unknown bugs and demonstrated superior performance over existing methods in terms of both coverage and validity. The adaptability and generalizability of our approach highlight the potential of integrating LLMs with multi-source knowledge for automated security testing, laying a solid foundation for the development of more intelligent and efficient fuzzing tools.

Author Contributions

Conceptualization, R.L. and K.Z.; Methodology, R.L.; Software, R.L.; Formal analysis, Z.P.; Data curation, Z.P.; Writing—original draft, R.L.; Writing—review and editing, X.Y. and K.Z.; Visualization, Z.P.; Supervision, X.Y. and K.Z.; Project administration, Z.P.; Funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the Program of the National Natural Science Foundation of China (Grant No.62271496).

Data Availability Statement

The knowledge base generated in this study and the code developed for the proposed method will be made publicly available at https://github.com/deepliao/DL_library_fuzzing, accesssed on 24 September 2025. The list of bugs identified has been released at https://github.com/deepliao/Bug_list, accesssed on 24 September 2025.

Acknowledgments

We sincerely appreciate all the anonymous reviewers for their insightful and constructive comments to improve this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Peng, A.; Huang, R. Research progress on the application of deep learning in fingerprint recognition. Pattern Recognit. 2026, 171, 112216. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. A Comprehensive Review of Deep Learning: Architectures, Recent Advances, and Applications. Information 2024, 15, 755. [Google Scholar] [CrossRef]
Wang, Y.; He, J.; Fan, L.; Li, H.; Chen, Y.; Zhang, Z. Driving Into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 14749–14759. [Google Scholar] [CrossRef]
Banzon, A.M.; Beever, J.; Taub, M. Facial Expression Recognition in Classrooms: Ethical Considerations and Proposed Guidelines for Affect Detection in Educational Settings. IEEE Trans. Affect. Comput. 2024, 15, 93–104. [Google Scholar] [CrossRef]
Wu, Y.; Li, H.; Zhang, L.; Dong, C.; Huang, Q.; Wan, S. Joint Intent Detection Model for Task-oriented Human-Computer Dialogue System using Asynchronous Training. ACM Trans. Asian Low Resour. Lang. Inf. Process. 2023, 22, 135:1–135:17. [Google Scholar] [CrossRef]
Yan, M.; Chen, J.; Jiang, T.; Jiang, J.; Wang, Z. Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries. IEEE Trans. Softw. Eng. 2025, 51, 1399–1414. [Google Scholar] [CrossRef]
Chen, J.; Wu, Z.; Wang, Z.; You, H.; Zhang, L.; Yan, M. Practical Accuracy Estimation for Efficient Deep Neural Network Testing. ACM Trans. Softw. Eng. Methodol. 2020, 29, 30:1–30:35. [Google Scholar] [CrossRef]
Deng, Y.; Xia, C.S.; Peng, H.; Yang, C.; Zhang, L. Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, 17–21 July 2023; Just, R., Fraser, G., Eds.; ACM: New York, NY, USA, 2023; pp. 423–435. [Google Scholar] [CrossRef]
Deng, Y.; Xia, C.S.; Yang, C.; Zhang, S.D.; Yang, S.; Zhang, L. Large Language Models are Edge-Case Generators: Crafting Unusual Programs for Fuzzing Deep Learning Libraries. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, 14–20 April 2024; ACM: New York, NY, USA, 2024; pp. 70:1–70:13. [Google Scholar] [CrossRef]
Böhme, M.; Cadar, C.; Roychoudhury, A. Fuzzing: Challenges and Reflections. IEEE Softw. 2021, 38, 79–86. [Google Scholar] [CrossRef]
Shiri Harzevili, N.; Wei, M.; Mohajer, M.M.; Pham, V.H.; Wang, S. Evaluating API-Level Deep Learning Fuzzers: A Comprehensive Benchmarking Study. ACM Trans. Softw. Eng. Methodol. 2025; Just Accepted. [Google Scholar] [CrossRef]
Wei, A.; Deng, Y.; Yang, C.; Zhang, L. Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, 25–27 May 2022; ACM: New York, NY, USA, 2022; pp. 995–1007. [Google Scholar] [CrossRef]
Deng, Y.; Yang, C.; Wei, A.; Zhang, L. Fuzzing deep-learning libraries via automated relational API inference. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, 14–18 November 2022; Roychoudhury, A., Cadar, C., Kim, M., Eds.; ACM: New York, NY, USA, 2022; pp. 44–56. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Jiang, Y.; Liang, J.; Ma, F.; Chen, Y.; Zhou, C.; Shen, Y.; Wu, Z.; Fu, J.; Wang, M.; Li, S.; et al. When Fuzzing Meets LLMs: Challenges and Opportunities. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, FSE 2024, Porto de Galinhas, Brazil, 15–19 July 2024; d’Amorim, M., Ed.; ACM: New York, NY, USA, 2024; pp. 492–496. [Google Scholar] [CrossRef]
Meng, R.; Mirchev, M.; Böhme, M.; Roychoudhury, A. Large Language Model guided Protocol Fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, CA, USA, 26 February–1 March 2024; The Internet Society: Fredericksburg, VA, USA, 2024. [Google Scholar]
Xia, C.S.; Paltenghi, M.; Tian, J.L.; Pradel, M.; Zhang, L. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, 14–20 April 2024; ACM: New York, NY, USA, 2024; pp. 126:1–126:13. [Google Scholar] [CrossRef]
Li, Z.; Wu, J.; Ling, X.; Luo, T.; Rui, Z.; Wu, Y. The Seeds of the FUTURE Sprout from History: Fuzzing for Unveiling Vulnerabilities in Prospective Deep-Learning Libraries. CoRR 2024, abs/2412.01317. [Google Scholar] [CrossRef]
Yuan, J.; Li, X.; Cheng, C.; Liu, J.; Guo, R.; Cai, S.; Yao, C.; Yang, F.; Yi, X.; Wu, C.; et al. Oneflow: Redesign the distributed deep learning framework from scratch. arXiv 2021, arXiv:2110.15032. [Google Scholar]
Bi, R.; Xu, T.; Xu, M.; Chen, E. PaddlePaddle: A Production-Oriented Deep Learning Platform Facilitating the Competency of Enterprises. In Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application, HPCC/DSS/SmartCity/DependSys 2022, Hainan, China, 18–20 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 92–99. [Google Scholar] [CrossRef]
Tong, Z.; Du, N.; Song, X.; Wang, X. Study on MindSpore Deep Learning Framework. In Proceedings of the 17th International Conference on Computational Intelligence and Security CIS 2021, Chengdu, China, 19–22 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 183–186. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, W.; Shen, C.; Li, Q.; Wang, Q.; Lin, C.; Guan, X. A Survey of Deep Learning Library Testing Methods. CoRR 2024, abs/2404.17871. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. CoRR 2023, abs/2312.10997. [Google Scholar] [CrossRef]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Vancouver, BC, Canada, 20–27 February2024; Wooldridge, M.J., Dy, J.G., Natarajan, S., Eds.; AAAI Press: Washington, DC, USA, 2024; pp. 17754–17762. [Google Scholar] [CrossRef]
Sheng, Z.; Wu, F.; Zuo, X.; Li, C.; Qiao, Y.; Lei, H. Research on the LLM-Driven Vulnerability Detection System Using LProtector. In Proceedings of the 2024 IEEE 4th International Conference on Data Science and Computer Application (ICDSCA), Dalian, China, 22–24 November 2024; pp. 192–196. [Google Scholar] [CrossRef]
Du, X.; Zheng, G.; Wang, K.; Feng, J.; Deng, W.; Liu, M.; Chen, B.; Peng, X.; Ma, T.; Lou, Y. Vul-RAG: Enhancing LLM-based Vulnerability Detection via Knowledge-level RAG. CoRR 2024, abs/2406.11147. [Google Scholar] [CrossRef]
Yoon, M.J.; Yoo, S.M.; Park, J.H.; Park, K.W. Toward a Synergistic Vulnerability Analysis Enhanced with a Multi-Level RAG Model. In Proceedings of the 2024 International Conference on AI x Data and Knowledge Engineering (AIxDKE), Tokyo, Japan, 11–13 December 2024; pp. 129–130. [Google Scholar] [CrossRef]
Guan, H.; Bai, G.; Liu, Y. Large Language Models Can Connect the Dots: Exploring Model Optimization Bugs with Domain Knowledge-Aware Prompts. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, 16–20 September 2024; Christakis, M., Pradel, M., Eds.; ACM: New York, NY, USA, 2024; pp. 1579–1591. [Google Scholar] [CrossRef]
Malkov, Y.A.; Yashunin, D.A. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 824–836. [Google Scholar] [CrossRef] [PubMed]
Jourdan, M.; Degenne, R. Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; [Google Scholar]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 24 September 2025).
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]

Figure 1. Two bug examples in PyTorch.

Figure 2. Overview of our approach.

Figure 3. An example of a bug that violates API constraints.

Figure 4. An example of knowledge extraction from Issue #124900.

Figure 5. Examples of bugs detected by our approach.

Table 1. Comparison of Conv2d Parameters in PyTorch, TensorFlow, and PaddlePaddle.

Parameter	PyTorch	TensorFlow	PaddlePaddle
Input Channels	`in_channels`	Not explicitly named	`in_channels`
Output Channels	`out_channels`	`filters`	`out_channels`
Kernel Size	`kernel_size`	`kernel_size`	`kernel_size`
Stride	`stride`	`strides`	`stride`
Padding	`padding`	`padding`	`padding`
Dilation	`dilation`	`dilation_rate`	`dilation`
Groups	`groups`	`groups`	`groups`
Bias	`bias`	`use_bias`	`bias_attr`

Table 2. Dimensions for characterizing historical vulnerability knowledge from bug reports.

Dimension	Description
API name	Summarize the key APIs involved in the issue.
Error causes	Analyze the root causes of error triggers.
Error symptom	Describe the observable error symptom.
Example code	Provide error reproduction code.

Table 3. Multi-dimensional representation of API constraints.

Dimension	Description
API_name	Records the official name of the API for clear identification and reference.
Signature	Describes the API’s formal definition, including the order and type of parameters.
Parameters	Specifies explicit constraints (e.g., data type, shape) for each parameter individually.
Global_constraints	Captures constraints that apply globally to the API, such as device restrictions or initialization requirements.
Implicit_constraints	Extracts additional constraints inferred from the documentation that are not directly stated but are necessary for correct usage.

Table 4. Versions and Constraint Knowledge Base Counts of Deep Learning Libraries.

Libraries	Version	Count
PyTorch	1.12	1624
TensorFlow	2.10	3136
OneFlow	0.9	858
PaddlePaddle	3.0.0	1052
MindSpore	2.2.14	2193

Table 5. Comparison of code and API coverage with traditional Methods.

	PyTorch		TensorFlow
	Line Coverage	API Coverage	Line Coverage	API Coverage
Codebase	113,538 (100%)	1593 (100%)	269,448 (100%)	3316 (100%)
FreeFuzz	15,688 (13.82%)	468 (29.38%)	78,548 (29.15%)	581 (17.52%)
DeepREL	15,794 (13.91%)	1071 (67.23%)	82,592 (30.65%)	1159 (34.95%)
Muffin	NA	NA	79,283 (29.42%)	79 (2.38%)
Ours	38,137 (33.59%)	1512 (94.91%)	124,728 (46.29%)	3150 (94.99%)

Table 6. Comparison API coverage with the LLM-based methods.

Library	Method	Valid	All	API Coverage(%)
PyTorch	TitanFuzz	1329	1593	83.43%
	FuzzGPT	1377	1593	86.44%
	Ours	1512	1593	94.91%
TensorFlow	TitanFuzz	2215	3316	66.80%
	FuzzGPT	2309	3316	69.63%
	Ours	3150	3316	94.99%

Table 7. Comparison of line coverage and valid seed rate with the LLM-based methods.

Method	PyTorch		TensorFlow
	Line Coverage	Valid Seeds	Line Coverage	Valid Seeds
Codebase	113,538 (100%)	–	269,448 (100%)	–
TitanFuzz	23,593 (20.78%)	69,397/382,092 (18.16%)	104,195 (38.67%)	68,195/510,139 (13.37%)
FuzzGPT	34,788 (30.64%)	45,492/159,300 (28.55%)	119,365 (44.30%)	71,685/331,600 (21.62%)
Ours	38,137 (33.59%)	49,098/159,300 (30.82%)	124,728 (46.29%)	84,325/331,600 (25.43%)

Table 8. Coverage and validity under different arms on three DL libraries.

Strategy	PaddlePaddle		OneFlow		MindSpore
Strategy	Coverage	Validity	Coverage	Validity	Coverage	Validity
rag_historical_vuln	21.54%	12.16%	41.80%	10.59%	19.33%	4.66%
api_compliant	21.59%	16.24%	42.31%	19.04%	19.74%	5.66%
violate_constraints	20.38%	15.26%	40.10%	18.42%	19.47%	5.48%
mixed_strategy	21.18%	12.00%	41.86%	10.85%	19.32%	4.42%
without_any_strategy	21.07%	14.38%	41.72%	11.46%	19.22%	5.16%

Table 9. Coverage and validity under different reward components on three DL libraries.

Reward Setting	PaddlePaddle		OneFlow		MindSpore
Reward Setting	Coverage	Validity	Coverage	Validity	Coverage	Validity
R- $R_{crash}$	21.43%	11.94%	41.96%	15.92%	19.12%	5.22%
R- $R_{coverage}$	21.24%	11.48%	42.38%	14.94%	19.19%	4.20%
R- $R_{structure}$	21.99%	11.48%	41.88%	10.93%	18.96%	4.64%
R- $R_{output}$	21.67%	11.56%	41.80%	15.96%	18.98%	4.82%
R (Full)	22.13%	13.80%	43.10%	16.96%	19.58%	5.28%

Table 10. Summary of detected bugs.

	Confirmed	Unknown	Total
PaddlePaddle	6	4	10
MindSpore	10	2	12
OneFlow	18	11	29
Total	34	17	51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, R.; Yan, X.; Pang, Z.; Zhu, K. Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing. Appl. Sci. 2025, 15, 10396. https://doi.org/10.3390/app151910396

AMA Style

Liao R, Yan X, Pang Z, Zhu K. Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing. Applied Sciences. 2025; 15(19):10396. https://doi.org/10.3390/app151910396

Chicago/Turabian Style

Liao, Rongtao, Xuehu Yan, Zeshan Pang, and Kailong Zhu. 2025. "Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing" Applied Sciences 15, no. 19: 10396. https://doi.org/10.3390/app151910396

APA Style

Liao, R., Yan, X., Pang, Z., & Zhu, K. (2025). Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing. Applied Sciences, 15(19), 10396. https://doi.org/10.3390/app151910396

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Balancing Validity and Vulnerability: Knowledge-Driven Seed Generation via LLMs for Deep Learning Library Fuzzing

Abstract

1. Introduction

2. Background and Motivation

2.1. Deep Learning Libraries and Bugs

2.2. RAG for Vulnerability Detection

2.3. Motivation

3. Methodology

3.1. Knowledge Base Construction

3.1.1. Vulnerability Knowledge Representations

3.1.2. Constraint Knowledge Representations

3.1.3. Vulnerability Knowledge Extraction

3.1.4. Constraint Knowledge Extraction

3.2. Knowledge Retrieval

3.2.1. Query Generation

3.2.2. Candidate Knowledge Retrieval

3.3. Knowledge-Enhanced Seed Generation

3.3.1. Arms Design

3.3.2. Reward Function Definition

3.3.3. Strategy Update Mechanism

3.4. Test Oracle

4. Implementation and Evaluation

4.1. Implementation

4.2. Baseline

4.3. Metrics

4.4. RQ1: Comparison of Seed Generation Quality with Existing Methods

4.5. RQ2: Contribution of Key Components to the Effectiveness of Our Approach

4.6. RQ3: Detection of Real-World Bugs Using Our Approach

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI