Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval

Liu, Jingyu; Zhou, Qidi; Ai, Jun; Shi, Tao

doi:10.3390/app16042124

Open AccessArticle

Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval

¹

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

²

China Academy of Electronics and Information Technology, Shijingshan District, Beijing 100041, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2124; https://doi.org/10.3390/app16042124

Submission received: 19 January 2026 / Revised: 11 February 2026 / Accepted: 19 February 2026 / Published: 22 February 2026

Download

Browse Figures

Versions Notes

Abstract

Ensuring numerical robustness in deep neural networks (DNNs) is critical, as defects like overflow or NaN can cause silent failures. However, automated repair is challenged by fragmented domain knowledge and the semantic gap for general-purpose large language models (LLMs). This work proposes NCKG, a Numerical–Conceptual Knowledge Graph-based method for retrieval-augmented repair of DNN numerical defects. NCKG introduces a unified semantic formalization that explicitly models DNN execution contexts, numerical defects, mitigation methods, and constraint knowledge, transforming dispersed defect knowledge into a consistent, machine-interpretable representation. Based on this formalization, a multi-view semantic graph index is constructed, enabling a hybrid semantic-driven retrieval mechanism that combines structure-aware graph matching with vector similarity. Retrieved, semantically aligned defect–repair knowledge is then used to guide LLMs in generating context-aware repairs. Experimental results demonstrate that NCKG significantly outperforms standard retrieval baselines and consistently improves the quality and correctness of LLM-generated fixes across different model scales. This work demonstrates that explicit semantic structuring and retrieval of domain knowledge are crucial for enabling reliable, automated numerical defect repair in DNNs.

Keywords:

numerical defect repair; deep neural networks; retrieval-augmented generation; semantic concept formalization

1. Introduction

Deep neural networks (DNNs) have become ubiquitous across critical domains such as autonomous systems, healthcare, and scientific computing. However, ensuring their numerical robustness remains a significant and persistent challenge [1,2]. DNN programs are inherently susceptible to subtle numerical defects—including overflow, underflow, division-by-zero, and gradient pathologies—that can lead to silent failures (e.g., incorrect outputs), training instability (NaN/Inf values), or complete runtime crashes [3,4].

Debugging and repairing such defects is notoriously difficult, requiring deep expertise in numerical analysis, framework specifics, and the problem domain. A core underlying issue is the heterogeneous and fragmented nature of the defect knowledge. Relevant information is scattered across various sources: Framework documentation details API-specific behaviors. Mathematical textbooks describe stability conditions. Issue trackers contain empirical bug reports, and commit histories hold concrete fix examples [5,6]. This knowledge exists in disparate formats—natural language, code, and symbolic constraints—resulting in fragmented and implicit knowledge that is hard to reuse or systematize.

Existing approaches to DNN repair exhibit notable limitations. Rule-based methods [7,8] and symbolic techniques [9] are precise but struggle with the diversity of real-world defects beyond their predefined patterns. These approaches typically operate at the level of strategy recommendation or implicit mitigation, rather than producing concrete, code-level repairs. Rule-based systems usually map observed numerical symptoms to a fixed set of predefined repair strategies, while symbolic methods often provide implicit corrections or safety constraints without explicitly modifying the source code, which makes it difficult to generalize to heterogeneous code contexts or to support direct, executable fixes in real-world DNN source code.

The emergence of powerful large language models (LLMs) offers a promising avenue for addressing software bugs. LLMs trained on vast code corpora can generate plausible code fixes. However, applying general-purpose LLMs directly to DNN numerical defects yields suboptimal results. These models lack specific, actionable knowledge about numerical stability patterns, the semantic relationships between DNN components (e.g., layers and optimizers) and failure symptoms, and the nuanced strategies required for valid fixes (e.g., precision adjustment vs. algorithm replacement). Consequently, LLM-generated patches for numerical issues often fail to address the root cause or introduce new errors.

To bridge these gaps, this paper introduces NCKG (a Numerical–Conceptual Knowledge Graph-based method) for graph-indexed, retrieval-augmented repair of DNN numerical defects. This work first establishes a unified conceptual semantic framework for DNN numerical defects. The framework centers on a concept graph that defines and links core semantic entities—such as the DNN component, the observed symptom, and the applicable mitigation strategy—along with their definitive relations. This formalization achieves semantic alignment across heterogeneous knowledge sources by providing a consistent, abstract representation, transforming fragmented artifacts into a coherent, machine-interpretable structure. Based on this formalization, a multi-view graph index is constructed, where each instance is decomposed into three focused semantic subgraphs (defect, context, and mitigation). These graphs serve as a structured retrieval index, enabling hybrid search that combines symbolic graph matching with subsymbolic vector similarity. For a new defect, NCKG retrieves the most semantically related concepts and rationales, enriching the prompt with relevant examples. This knowledge-guided context is then provided to an LLM to generate a context-aware repair. This study employs NCKG to integrate structured, semantic knowledge, aiming to enhance both the retrieval of relevant fixes and the generation of subsequent repairs by LLMs.

The primary contributions of this work are as follows:

Constructing a conceptual semantic framework for DNN numerical defects to unify and interrelate heterogeneous defect–repair knowledge. A structured conceptual semantic framework is proposed to model DNN numerical defects through explicit representations of faulty components, numerical anomaly states, repair operations, and domain constraints. By focusing on conceptual representations, this framework unifies heterogeneous defect–repair artifacts, enabling transferable defect–repair knowledge beyond individual code instances.
Developing a multi-view semantic-knowledge-driven graph indexing and hybrid retrieval method. Building on the proposed semantic framework, an index-based GraphRAG extension for numerical defect retrieval is introduced, where multi-view semantic subgraphs are constructed to capture complementary aspects of defect knowledge, and a hybrid retrieval mechanism is designed to integrate structure-aware subgraph matching with semantic vector similarity, allowing precise yet flexible retrieval of relevant defect–repair knowledge grounded in domain semantics.
Proposing a knowledge-guided LLM-based repair generation pipeline for DNN numerical defects. A knowledge-guided repair pipeline is developed to leverage the retrieved semantic context to guide large language models, reducing hallucinations and improving the reliability of numerical defect–repair generation.

To systematically evaluate the effectiveness of the proposed NCKG method, this study is guided by the following research questions: RQ1 assesses the fundamental retrieval capability, testing whether NCKG retrieves more semantically relevant fixes than conventional methods. RQ2 evaluates the end-to-end repair generation performance, examining whether augmenting LLMs with this retrieved knowledge yields more reliable repairs. RQ3 performs an ablation study to deconstruct the hybrid mechanism, quantifying the individual contribution of its graph-based and vector-based components to substantiate the design rationale. The specific questions guiding this investigation are as follows:

RQ1: Comparative Evaluation of Retrieval Effectiveness. Does NCKG enable more semantically accurate retrieval of numerical defect–repair knowledge compared to existing retrieval methods, as measured by the semantic alignment between the retrieved mitigation strategies and the ground-truth strategies?
RQ2: Impact of the Retrieved Context on Repair Generation. How does the quality of the final generated repair change when large language models are augmented with NCKG-retrieved semantic contexts across different retrieval backbones and generative models?
RQ3: Ablation Study of the Hybrid Retrieval Mechanism. What are the individual contributions of the graph-based and vector-based retrieval components within the hybrid NCKG framework to overall repair effectiveness?

This paper is organized as follows: Section 2 reviews related work. Section 3 details the proposed methodology, which consists of four main parts: the formal definition of conceptual semantics and relations for DNN numerical defects, the construction of a graph index via semantic subgraphs, a hybrid semantic retrieval mechanism over graph indices, and the knowledge-guided repair generation pipeline. Section 4 and Section 5 present the experimental setup and the analysis of the results, respectively. Section 6 provides a discussion of the findings, and Section 7 concludes the paper.

2. Related Work

The related work is investigated from the perspectives of knowledge engineering for software tasks, retrieval-augmented generation with large language models, and automated repair of DNN numerical defects.

2.1. Knowledge Engineering for Software Tasks

Knowledge engineering, which involves the formal representation and utilization of domain knowledge, has a long history in software engineering. Traditional approaches often rely on manually or semi-automatically constructed knowledge bases, such as rule-based systems [10,11] and code property graphs [12,13]. For instance, pattern-based bug detection tools use predefined defect patterns encoded as rules [14,15,16]. In program analysis, semantic code graphs [17,18,19] (e.g., control flow graphs, data flow graphs, and program-dependent graphs) have been extensively used to capture structural and semantic relationships for tasks like clone detection, vulnerability analysis, and program understanding. More recently, the concept of “knowledge graphs” [20] has been adopted to integrate heterogeneous software artifacts (e.g., code, commits, issues, and documentation) to support tasks such as code search, API recommendation, and bug localization.

However, applying such knowledge engineering techniques to the specific domain of deep neural network (DNN) numerical defects presents distinct and significant challenges. First, the relevant knowledge is highly heterogeneous, spanning multiple dimensions: mathematical–numerical constraints, hardware-specific floating-point behavior, concrete defect–fix examples, empirical debugging heuristics, and framework-specific API specifications [5,6]. This diversity makes it difficult for traditional symbolic knowledge representations to adapt to and align with the semantics across varied textual and code-based artifacts. Second, unlike traditional software with explicit algorithmic logic, DNNs exhibit opaque, data-driven behavior [21], making it harder to define and capture their failure modes and repair strategies within a static, rule-bound schema.

In other engineering domains, similar challenges of heterogeneous information have been addressed through ontological frameworks. For example, in the maintenance domain, Zhou et al. [22] addressed the inconsistent propagation of maintenance process information (MPI) by establishing a lifecycle MPI propagation architecture to unify the semantic description of maintenance activities. Similarly, Xia et al. [23] proposed a Maintenance Knowledge Graph (MKG) ontology framework to tackle inefficient knowledge utilization and the lack of unified standards in maintenance scenarios. This framework provides a paradigm model for structuring maintenance information based on a hierarchical representation of maintenance elements.

The proposed method introduces a lightweight, domain-grounded conceptual semantic schema tailored for DNN numerical defects. Instead of attempting to construct a comprehensive, general-purpose code graph, this method departs from code-structure–centric semantic defect modeling approaches and focuses on building a concept-centric graph that distills the key entities along with their causal and logical relations, achieving semantic alignment based on the concept graph and enabling semantic associations directly related to repair.

2.2. Prompt Engineering and Retrieval-Augmented Generation for LLMs in Code Processing

The effectiveness of large language models (LLMs) on downstream tasks is critically dependent on the input context provided, a practice known as prompt engineering [24]. Retrieval-augmented generation (RAG) [25] enhances this factor by dynamically fetching relevant external knowledge to condition the LLM, mitigating issues like hallucination and improving factual accuracy. Research in prompt engineering has explored various in-context learning designs [26], such as providing few-shot examples, chain-of-thought reasoning steps, or specific instruction formats. Tools like ThinkRepair [27] retrieve “bug–Chain-of-Thought–fix” triples to explicitly guide the LLM’s reasoning process for general program repair.

The retrieval core of most existing RAG systems for code relies primarily on measuring similarity between unstructured text passages, such as code snippets or commit messages [28]. Sparse retrievers like BM25 [29] are matched based on keyword overlap, while dense retrievers like Dense Passage Retrievers (DPRs) [30] or those using code-specific sentence transformers (e.g., CodeBERT) are matched based on learned semantic embeddings. A fundamental limitation of this text-centric paradigm is semantic misalignment: textual similarity does not guarantee that the retrieved example illustrates the same type of problem or employs the same repair strategy as the query. This is particularly problematic for complex, semantically nuanced tasks like numerical defect repair, where the relevance of a fix is defined by specific conceptual relationships rather than surface-level text.

To overcome the limitations of unstructured text retrieval, graph retrieval-augmented generation (GraphRAG) [31] has emerged as a paradigm that leverages structured knowledge to better ground LLMs. Existing GraphRAG approaches can be categorized based on their utilization of the graph structure. Knowledge-based GraphRAG [32,33] uses graphs primarily as carriers of factual knowledge, where entities and relations from a pre-existing, often general-purpose, knowledge graph are retrieved to enrich the context. Index-based GraphRAG [34,35], conversely, uses a graph as a structured index over a corpus. Here, data points (e.g., bug-fix pairs) are decomposed into entities and relations to build a task-specific graph index, and retrieval is performed by matching the query’s structured representation against this index. Hybrid GraphRAG methods attempt to combine both approaches.

The NCKG method presented in this work is an instantiation of the index-based GraphRAG paradigm, and it is specifically adapted for DNN numerical defects. NCKG advances this paradigm through domain-specific semantic modeling and indexing design. In particular, a multi-view graph index is constructed based on a conceptual semantic schema, where three focused subgraph indices are built to capture complementary facets—defect diagnosis, component context, and mitigation strategy—of each defect–fix pair. Retrieval operates over structure-aware subgraph matching with semantic vector similarity, enabling precise alignment between a query’s defect scenario and the repair logic of candidate examples.

2.3. Automated Repair for DNN Numerical Defects

Automatically addressing numerical instability in DNNs is a critical and distinct sub-field. Early approaches were largely rule-based. AutoTrainer [7] defined a set of training-time symptoms and applied a series of predefined candidate fixes in sequence. DeepDiagnosis [8] extended this by using a decision tree to prune the search space of fixes based on a broader symptom set. In both cases, “repair” is operationalized as selecting from a fixed set of coarse-grained actions (e.g., adjusting the learning rate or initialization) rather than generating concrete, context-sensitive code-level patches. These methods are inherently limited by their predefined rule sets and their reliance on executable training pipelines with runtime monitoring, which restricts their applicability to real-world scenarios.

Importantly, semantics-driven approaches have also been explored. RANUM [9] frames numerical repair as a constraint optimization problem, using abstract interpretation to compute safe input intervals for operators to prevent instability (e.g., division by zero). While powerful for range-related defects, RANUM adopts a black-box fixing paradigm and produces constraint-level artifacts rather than explicit source-code repairs, and its applicability is constrained to defects expressible via interval constraints and may not cover other common issues like gradient vanishing/explosion or algorithmic inefficiency.

Very recently, researchers have begun exploring LLMs for DNN debugging and repair, typically via prompt engineering that asks the model to localize faults and suggest patches [36]. However, these initial attempts often treat the problem as generic code generation, lacking mechanisms to incorporate domain-specific knowledge about numerical stability patterns. There is a clear gap between the generic capabilities of LLMs and the specialized knowledge required for reliable numerical defect repair. This work bridges this gap by introducing a knowledge-aware retrieval framework grounded in a conceptual semantic model of DNN numerical defects. Instead of relying on unstructured text retrieval or fixed rule mappings, NCKG retrieves defect–fix examples based on explicit semantic alignment among defect symptoms, affected components, and repair strategies. This positions NCKG as a novel synthesis of concept-level knowledge modeling, graph-based retrieval, and LLM-based repair generation for a specialized, high-stakes domain.

3. Methodology

The proposed method enhances generative large language model (LLM) repair generation for DNN numerical defects through structured knowledge retrieval. The method is demonstrated in Figure 1 and comprises three phases: (1) the construction of semantic graphs, (2) a hybrid retrieval mechanism over these graphs, and (3) a guided generation process.

3.1. Unified Framework of Conceptual Semantics and Relations for DNN Numerical Defect

Numerical defects in deep neural networks are documented and discussed across a wide range of heterogeneous sources, including source code repositories, mathematical references, and informal developer discussions. These sources differ substantially in structure, abstraction level, and terminology. As a result, knowledge related to the same numerical defect or repair strategy is often fragmented and difficult to integrate, limiting systematic reasoning and reuse across cases.

A key challenge in numerical defect repair lies in unifying such heterogeneous knowledge into a consistent and machine-interpretable representation. Without explicit semantic abstraction, it becomes difficult to align defect symptoms with their causes, relate repairs across different components, or generalize repair strategies beyond individual instances. To address this challenge, a conceptual semantic framework is introduced to abstract numerical defects and their repairs into a unified representation, as shown in Figure 2. This framework decouples defect semantics from source-specific representations by identifying a small set of core semantic elements and explicitly modeling their relationships. By doing so, heterogeneous information originating from diverse sources can be normalized into a common semantic space, enabling consistent interpretation, integration, and subsequent reasoning.

3.1.1. Numerical Defect Semantic Elements Definition

The core of the framework comprises four semantic elements, as defined in Table 1. Detailed mathematical definitions for each semantic element are provided in Appendix A.1.

The core elements are formally aligned with two established foundational ontologies from complementary domains: FIDES [37] ontologies, which semantically annotates data, parameters, and processes across the ML’s life cycle, and the Ontology of Software Defects, Errors and Failures (OSDEF) [38], a reference model grounded in the Unified Foundational Ontology (UFO) [39] for clarifying software anomaly concepts.

The proposed execution context (C), numerical defect (

N D

), and mitigation method (M) elements directly map to the execution–executor–procedure (EEP) and result–context (RC) Ontology Design Patterns (ODPs), which form the backbone of the EEPSA ontology [40] reused by the FIDES framework. The DNN component (the executor) performs an execution within a specific C, which may implement a faulty mathematical procedure that manifests as an

N D

. Concurrently, within the RC ODP, the observed

N D

is the result, occurring within the circumstantial context defined by C. The M then acts as a corrective procedure for this specific result–context pair, with its validity anchored by the constraint (K). This structuring allows the framework to reuse FIDES’s approach of formally linking procedures (

N D

and M) with their outcomes and contexts.

Simultaneously, the framework’s concepts are grounded in the OSDEF ontology. Here, the

N D

is specialized from OSDEF’s core taxonomy: a numerical defect (a bug in the code or model specification) may cause a runtime error (describe an incorrect internal state, e.g., an overflow, refers to

ρ

), which can lead to a failure (an observable deviation from expected behavior, e.g., a NaN output, refers to

σ

). The C precisely locates this chain within the DNN pipeline artifact, while M targets the rectification of the underlying defect or error.

By integrating the EEP/RC ODPs from FIDES/EEPSA with the conceptual taxonomy of OSDEF, this framework achieves a multifaceted foundation. It inherits OSDEF’s rigorous, UFO-based distinctions for software anomalies, ensuring clarity between defects, errors, and failures in the DNN context. Concurrently, it adopts FIDES’s reusable, pattern-based method for structurally linking the what (procedure of defect to failure), where (context), and how (procedure of fix to failure) of a DNN numerical defect. Collectively, the semantic elements C,

N D

, M, and K provide a comprehensive and structured abstraction of numerical defects and their repairs.

3.1.2. Semantic Relation Schema

The semantic richness of the framework emerges from the defined relations R that link the conceptual entities. These relations form a directed, labeled multi-graph schema, establishing causal, associative, and applicative links. The relation schema R is formally defined as a set of relation types

r_{i}

, each associating a source entity type

S_{i}

with a target entity type

T_{i}

, capturing a specific semantic dependency

ϕ_{i}

.

The full relation schema, derived from domain analysis, is presented in Table 2. Each relation is defined between specific attributes of the entities, effectively establishing paths between entity instances in the knowledge graph.

Let

E = {Execution Context (C), Numerical Defect (N D), Mitigation Method (M), Constraint (K)}

be the set of entity types. A relation

r \in R

is a triple

r = (S, T, ϕ)

where

S, T \in E

and

ϕ

are predicates describing the relationship.

This structured relation schema, R, transforms a collection of isolated attributes into a connected, semantic knowledge graph. A path through this graph, such as

σ \to ρ \to μ \leftarrow ψ

, represents a coherent reasoning chain: from a symptom to its root cause, leading to a method that instantiates a particular strategy. This formalism enables the subgraph matching and retrieval processes detailed in the next chapter, which are essential for augmenting LLM-based repair generation with precise, context-aware exemplars.

3.2. Graph-Index Construction via Semantic Subgraphs

The core of the proposed methodology is a structured graph-indexed knowledge base that is engineered to enable efficient and semantically precise retrieval. The graph index is designed to integrate multi-source knowledge, for example, code-level defect–fix pairs and descriptive knowledge concerning numerical defects. The construction involves two main phases: (1) extracting instances of the formal semantic entities from each data point and (2) organizing these entities into three distinct, interlinked semantic subgraphs that serve as the primary indexing units. These subgraphs capture complementary facets of the defect–repair–knowledge triad, facilitating targeted retrieval based on different query aspects (e.g., symptom, component, or mitigation strategy).

Given a corpus of defect–fix pairs and related documents

D = {d_{1}, d_{2}, \dots, d_{i}}

, each data point

d_{i}

typically contains a defective code snippet, its corrected version, and accompanying textual descriptions (e.g., commit messages, issue reports, and forum discussions). For each

d_{i}

, a concept extraction function,

Φ

, is applied, which combines rule-based parsing or lightweight entity recognition models to map the raw data to a set of structured entity instances:

E_{i} = {C_{i}, N D_{i}, M_{i}, K_{i}} = Φ (d_{i})

(1)

where

C_{i}

,

N D_{i}

,

M_{i}

, and

K_{i}

correspond to instances of the execution context, numerical defect, mitigation method, and constraint entities, respectively, following the formal definitions in Section 3.1.1. For the construction of the retrieval knowledge base, the concept extraction procedure primarily follows and adopts the manual analysis results aligned with the DeepStability dataset [5]. Specifically, for high-level semantic dimensions such as the phase type, symptom type, and strategy type, the extraction is grounded in a predefined set of abstract concept categories (see the definition part in Appendix A.1), which act as canonical semantic anchors. The alignment between extracted concepts and the underlying ontologies is established at the concept definition level through systematic ontology reuse and specialization. Notice that the semantic elements of NCKG reuse and specialize core concepts and design patterns from the FIDES and OSDEF ontologies ensure semantic consistency at the definition level; these ontology-aligned concept definitions could provide explicit semantic guidance for the extraction function

Φ

, enabling consistent instantiation across heterogeneous data sources. Further details are provided in Appendix A.2.

Further, to construct a monolithic heterogeneous graph, the extracted entity attributes are decomposed into three focused semantic subgraphs per data point. Each subgraph serves as a distinct indexing lens, capturing orthogonal dimensions of the defect–repair–knowledge relationship, as shown in Figure 3. This multi-view decomposition enables targeted retrieval operations aligned with different query intents.

Defect Semantic Index ( $G^{(d)}$ ): This index models the diagnostic pathway from observable symptoms to root causes, as contextualized by background knowledge. For each data point

d_{i}

, the corresponding subgraph

G_{i}^{(d)} = (V_{i}^{(d)}, E_{i}^{(d)})

is constructed, where

\begin{matrix} V_{i}^{(d)} \subseteq & {σ_{i}, ρ_{i}, χ_{i}, ξ_{i}} \\ E_{i}^{(d)} = & {(σ_{i} \overset{s y m p t o m_i n d i c a t e s}{\to} ρ_{i}), \\ (χ_{i} \overset{c o n t e x t_i n f o r m s_c a u s e}{\to} ρ_{i}), \\ (ξ_{i} \overset{k n o w l e d g e_e x p l a i n s_c o n t e x t}{\to} χ_{i})} \end{matrix}

(2)

This subgraph encodes the diagnostic reasoning chain, serving as an index focusing on error symptoms and their causation.

Context Semantic Index ( $G^{(c)}$ ): This index anchors defects within the DNN computational architecture by linking component types to specific functions and mathematical operations. The per-instance subgraph

G_{i}^{(c)} = (V_{i}^{(c)}, E_{i}^{(c)})

is defined as

\begin{matrix} V_{i}^{(c)} \subseteq & {τ_{i}, f_{i}, o p_{i}, σ_{i}} \\ E_{i}^{(c)} = & {(τ_{i} \overset{p h a s e_d e f i n e s}{\to} f_{i}), \\ (o p_{i} \overset{o p e r a t i o n_i m p l e m e n t s}{\to} f_{i}), \\ (σ_{i} \overset{s y m p t o m_m a n i f e s t s_i n}{\to} f_{i})} \end{matrix}

(3)

This subgraph captures the computational context, enabling retrieval based on architectural or operational similarity.

Repair Semantic Index ( $G^{(m)}$ ): This index captures the repair rationale by connecting diagnosed causes to concrete mitigation methods that are constrained by external knowledge. Each instance yields a subgraph

G_{i}^{(m)} = (V_{i}^{(m)}, E_{i}^{(m)})

where

\begin{matrix} V_{i}^{(m)} \subseteq & {ρ_{i}, χ_{i}, ψ_{i}, μ_{i}, ξ_{i}} \\ E_{i}^{(m)} = & {(ρ_{i} \overset{c a u s e_s u g g e s t s_m e t h o d}{\to} μ_{i}), \\ (χ_{i} \overset{c o n t e x t_s u g g e s t s_m e t h o d}{\to} μ_{i}), \\ (ψ_{i} \overset{s t r a t e g y_g e n e r a l i z e s}{\to} μ_{i}), \\ (ξ_{i} \overset{k n o w l e d g e_m o t i v a t e s_s t r a t e g y}{\to} ψ_{i})} \end{matrix}

(4)

This subgraph encapsulates the repair logic, indexing the pathway from cause analysis to actionable fixes.

These multi-view graph index structures form the foundation for the hybrid semantic retrieval described in the next section.

3.3. Hybrid Semantic Retrieval over Graph Indices

The constructed multi-view graph index enables a hybrid semantic retrieval mechanism that combines symbolic graph matching with subsymbolic vector similarity. Given a query representing a new numerical defect instance, the objective is to retrieve the most relevant defect–fix information from the knowledge base to augment LLM-based repair generation. The process involves three main steps: (1) query representation, (2) subgraph-based similarity retrieval, and (3) multi-view score aggregation and hybrid fusion.

3.3.1. Query Representation

An incoming query q, which may consist of code snippets, error messages, or natural language descriptions, is first processed by the same concept extraction function

Φ

used during index construction. In our experiments, the

Φ

is implemented primarily via LLM-based analysis followed by fuzzy matching to the predefined semantic candidate set, ensuring retrieval efficiency with the knowledge graph index. This yields a set of semantic entities

E_{q} = {C_{q}, N D_{q}, M_{q}, K_{q}} = Φ (q)

, where some entities may be partially specified or entirely missing. Three query subgraphs (

G_{q}^{(d)}

,

G_{q}^{(c)}

, and

G_{q}^{(m)}

) are instantiated from

E_{q}

, mirroring the index structure. Note that

M_{q}

(the mitigation) is typically empty as the repair is unknown.

A key advantage of the graph-based representation is its ability to perform conceptual-level contextual expansion. When a query lacks certain entities (e.g., no mitigation information is present), the semantic relations in R could be used to traverse the graph indices and infer potentially relevant concepts. For example, if a query specifies a symptom

σ_{q}

and a component function

f_{q}

, the

s y m p t o m_i n d i c a t e s

and

c a u s e_s u g g e s t s_m e t h o d

relations can be chained to retrieve mitigation methods associated with similar symptom–function pairs. This expansion is formally defined as a path traversal operation:

E x p a n d (q) = ⋃_{p \in P} {n | \exists path p from v_{q} to n in G^{(k)} with length < L}

(5)

where

P

is a set of relation sequences predefined as plausible inference chains and L is a maximum traversal depth. The expanded nodes enrich the query context without requiring explicit specification.

3.3.2. Subgraph-Based Similarity Retrieval

For each subgraph type k, the retrieval algorithm computes a similarity score between

G_{q}^{(k)}

and every candidate subgraph

G_{i}^{(k)}

in the corresponding index

G^{(k)}

. The similarity function

s i m_{G}^{(k)} : G_{q}^{(k)} \times G_{i}^{(k)} \to [0, 1]

is defined as a weighted combination of node overlap and edge overlap and is formalized as follows:

Let

V_{q}

and

V_{i}

denote the node sets of

G_{q}^{(k)}

and

G_{i}^{(k)}

respectively, where each node is identified by its semantic attribute value (e.g., “NaN” or “Adam”). Similarly, let

E_{q}

and

E_{i}

denote the edge sets, where each edge is represented as a triple

(u, v, r)

indicating a relation r from node u to node v.

Node Overlap Similarity: The Jaccard similarity coefficient between node sets measures the conceptual commonality:

$s i m_{n o d e} (G_{q}^{(k)}, G_{i}^{(k)}) = \frac{| V_{q} \cap V_{i} |}{| V_{q} \cup V_{i} |}$

(6)

where intersection is defined as nodes sharing identical semantic attributes, accounting for potential variations through normalization techniques such as lower casing and those stemming for textual attributes.
Edge Overlap Similarity: Structural similarity is measured via edge set overlap:

$s i m_{e d g e} (G_{q}^{(k)}, G_{i}^{(k)}) = \frac{| E_{q} \cap E_{i} |}{| E_{q} \cup E_{i} |}$

(7)

where two edges are considered identical if they share the same source node attribute, target node attribute, and relation type r.

Overall graph similarity is computed as a convex combination:

s i m_{G}^{(k)} (G_{q}^{(k)}, G_{i}^{(k)}) = α \cdot s i m_{n o d e} + (1 - α) \cdot s i m_{e d g e}, α \in [0, 1]

(8)

Empirically,

α = 0.6

is set to slightly prioritize node (conceptual) overlap while maintaining structural consistency.

For each subgraph type k, this process yields a ranked list

L^{(k)}

of candidate defect–fix pairs, where each pair

d_{i}

is associated with a graph similarity score

s_{i}^{(k)} = s i m_{G}^{(k)} (G_{q}^{(k)}, G_{i}^{(k)})

. The candidates per subgraph are retained for subsequent fusion.

3.3.3. Multi-View Score Aggregation and Hybrid Retrieval

To produce a unified retrieval result, a two-level fusion strategy is applied. First, the scores from each view are aggregated into a single graph retrieval score

S_{G} (d_{i})

for each candidate

d_{i}

:

S_{G} (d_{i}) = \sum_{k} β_{k} \cdot s_{i}^{(k)}

(9)

where

β_{k}

are view-specific weights that reflect the relative importance of each semantic perspective. These weights can be tuned or set uniformly (e.g., 0.3/0.5/0.2 in our experiments). Candidates appearing in multiple lists benefit from score accumulation.

Simultaneously, a vector-based retrieval score

S_{V} (d_{i})

is computed using dense embeddings. The query q and each candidate

d_{i}

(represented by the concatenated textual attributes) are encoded into fixed-dimensional vectors

v_{q}

and

v_{i}

using a pre-trained language model.

Then the hybrid retrieval algorithm integrates graph-based and vector-based matching through a multi-stage fusion process. Given a query q with its extracted semantic entities and subgraphs

G_{q}^{(k)}

, the algorithm returns the top-K most relevant defect–fix pairs. Algorithm 1 presents the pseudocode of the proposed hybrid retrieval process.

Algorithm 1 Pseudocode of the hybrid retrieval method

Input:: Query q with semantic subgraphs $G_{q}^{(d)}, G_{q}^{(c)}, G_{q}^{(m)}$ ; Graph indices $G^{(d)}, G^{(c)}, G^{(m)}$ ; Vector index $V$ with embeddings ${v_{i}}_{i = 1}^{N}$ ; Retrieval parameter K; Fusion weights $β_{1}, β_{2}, β_{3}$ ; Fusion parameter $γ$
Output:: Ranked list of top-K defect–fix pairs $R$
1:: Phase 1: Parallel Candidate Retrieval
2:: $C_{G} \leftarrow Ø$ {Graph-based candidates}
3:: $C_{V} \leftarrow Ø$ {Vector-based candidates}
4:: for $k in {d, c, m}$ do
5:: Compute $s_{G}^{(k)} (d_{i})$ for all $d_{i} \in D$ using ${sim}_{G}^{(k)} (G_{q}^{(k)}, G_{i}^{(k)})$ {Graph retrieval per subgraph view}
6:: end for
7:: for each $d_{i} \in D$ do
8:: $S_{G} (d_{i}) \leftarrow \sum_{k = 1}^{3} β_{k} \cdot s_{G}^{(k)} (d_{i})$ {Aggregate graph scores}
9:: end for
10:: $C_{G} \leftarrow {Top}_{2 K} (D, S_{G})$ {Top $2 K$ by graph score}
11:: {Vector retrieval}
12:: $v_{q} \leftarrow Encoder (q)$ {Encode query to vector}
13:: for each $d_{i} \in D$ do
14:: $S_{V} (d_{i}) \leftarrow cos (v_{q}, v_{i})$ {Compute vector similarity}
15:: end for
16:: $C_{V} \leftarrow {Top}_{2 K} (D, S_{V})$ {Top $2 K$ by vector score}
17:: Phase 2: Candidate Pool Formation
18:: $C \leftarrow C_{G} \cup C_{V}$ {Union of candidate sets}
19:: Phase 3: Score Normalization and Fusion
20:: for each $d_{i} \in C$ do
21:: ${\tilde{S}}_{G} (d_{i}) \leftarrow \frac{S_{G} (d_{i}) - {min}_{d_{j} \in C} S_{G} (d_{j})}{{max}_{d_{j} \in C} S_{G} (d_{j}) - {min}_{d_{j} \in C} S_{G} (d_{j})}$ {Normalize graph score}
22:: ${\tilde{S}}_{V} (d_{i}) \leftarrow \frac{S_{V} (d_{i}) - {min}_{d_{j} \in C} S_{V} (d_{j})}{{max}_{d_{j} \in C} S_{V} (d_{j}) - {min}_{d_{j} \in C} S_{V} (d_{j})}$ {Normalize vector score}
23:: $S_{hybrid} (d_{i}) \leftarrow γ \cdot {\tilde{S}}_{G} (d_{i}) + (1 - γ) \cdot {\tilde{S}}_{V} (d_{i})$ {Compute hybrid score}
24:: end for
25:: Phase 4: Re-ranking and Final Selection
26:: $R \leftarrow SortDescending (C, S_{hybrid})$ {Sort by hybrid score}
27:: return $R [1 : K]$ {Return top-K results}

The parameter

γ

can be adjusted based on query characteristics, such as the completeness of its semantic graph, allowing adaptive emphasis on structural or semantic features by considering candidates that excel in retrieval modality. The method mitigates the risk of missing relevant matches due to the limitations of any single approach.

3.4. Knowledge-Guided Repair Generation

The top-K retrieved defect–fix pairs and associated knowledge are then used to augment a generative large language model (LLM) for repair generation. The key idea is to construct a structured prompt that provides the LLM with relevant context and examples, guiding it to produce a correct and context-aware fix for the query defect.

3.4.1. Prompt Design for Contextual Repair Generation

The prompt is designed as a multi-context instructional template that presents the model with both in-context examples and the target problem. The template structure is formally defined as follows:

Let

R = {d_{i_{1}}, \dots, d_{i_{k}}}

be the top-K retrieved defect–fix pairs. For each pair

d_{i_{j}} \in R

, let

{bug}_{i_{j}}

and

{fix}_{i_{j}}

denote its defective and repaired code snippets, respectively. Let

{bug}_{q}

denote the defective code from the query q.

The prompt P is constructed as a concatenation of three components:

1.: Instruction Header: A fixed natural language instruction specifying the repair task.
2.: Retrieved Exemplars Block: A sequence of $m (m \leq K)$ retrieved exemplars, formatted as shown below.
3.: Target Problem Block: The query’s defective code with a placeholder for the model to complete.

Listing 1 shows the structure of the knowledge-guided repair prompt template.

Listing 1. Knowledge-Guided Repair Prompt Template.

In practice, m is chosen based on the context window of the target model, which is typically set to 1–2 exemplars. The exemplars are selected from

R

based on their hybrid retrieval scores, ensuring the most relevant examples are provided.

3.4.2. Generation and Post-Processing

Given the prompt P constructed as shown above, the LLM generates a completion that ideally contains the fixed version of the target function. The generation probability can be formalized as

P_{L M} (fix | P) = \prod_{t = 1}^{T} P_{L M} (w_{t} | w_{< t}, P)

(10)

where

w_{t}

represents the t-th token in the generated fix.

Following the generative phase, a systematic post-processing pipeline extracts and validates the repair candidates from the LLM outputs. The extraction process identifies the code block immediately following the “### Fixed Function:” marker in the model’s generated text.

To enhance robustness and account for the stochastic nature of generative models, a multi-sampling strategy is employed. For each target defect, the generation process is executed independently n times using identical prompt templates and retrieval contexts. Each generated output is processed through the aforementioned extraction method. Successful repair candidates are retained, while malformed outputs are discarded.

The validated repair candidates are aggregated through frequency analysis. Let

C_{g e n} = {c_{1}, c_{2}, \dots, c_{m}}

represent the set of distinct, syntactically valid repair candidates generated across n trials. Each candidate

c_{j}

is associated with a frequency count

f_{j}

, corresponding to the number of times an identical repair was generated across all successful trials. This frequency distribution provides a measure of confidence for each candidate, with higher frequencies indicating greater consistency in the model’s repair generation.

Mathematically, this process can be formalized as

C_{f i n a l} = {(c_{j}, f_{j}) | c_{j} \in C_{g e n}, f_{j} = \sum_{i = 1}^{n} I (extract (O_{i}) = c_{j})}

(11)

where

O_{i}

denotes the model output from the i-th generation trial and

I

is the indicator function.

4. Experimental Setup

To evaluate the effectiveness of the proposed NCKG framework and to answer the research questions outlined in Section 1, we conduct an experimental study covering retrieval accuracy, repair generation quality, and component-level ablation analysis. The experiments are designed to reflect realistic DNN numerical defect repair scenarios across different generative models. Specifically, we evaluate NCKG on a curated dataset of real-world numerical defects, compare it against representative baseline methods, and assess performance using both retrieval-oriented and repair-oriented evaluation metrics. All experiments are implemented under controlled and reproducible settings.

4.1. Dataset

The experiments are conducted on a curated dataset combining code-level defect–fix pairs and text-level numerical knowledge. The primary source is the DeepStability dataset [5], which contains real-world numerical stability issues in deep learning frameworks (e.g., TensorFlow and PyTorch) and their fixes, which are extracted from commit histories. Each instance includes the defective code, the fixed code, commit messages, and often related issue reports. This is augmented with numerical defect knowledge documents from relevant sources, including Stack Overflow posts tagged with numerical issues and excerpts on numerical analysis [6].

In this study, we framed the task as retrieving relevant defect repair code pairs. After filtering, a curated dataset of 153 instances was retained. To rigorously evaluate the retrieval capability of NCKG and prevent data leakage, strict separation is maintained between data used for knowledge base construction and testing. The knowledge graph is built solely from instances in the training split. During testing, the model is presented only with the unseen instances in the test split. The mitigation-type proportion of the dataset is shown in Figure 4. The dataset is split into 80%/20% for training (indexing) and testing, respectively, with the distribution remaining unchanged. To mitigate potential biases introduced by a single random split, we repeat the test procedure 5 times with different random seeds for test set construction.

4.2. Baseline Methods

We compare NCKG’s hybrid retrieval capability against representative baselines from two distinct retrieval paradigms: lexical term-based retrieval and dense neural retrieval. The purpose of this comparison is to assess whether conceptual graph-based retrieval provides advantages beyond both “shallow” and dense retrieval methods.

BM25 baseline—a classic lexical retrieval model that relies on exact keyword matching and term frequency statistics. BM25 is included as a representative lexical-based retrieval baseline and is applied to the concatenated textual fields of each defect–fix pair, including code tokens and commit messages. This baseline represents shallow lexical matching methods that do not explicitly model semantic or structural relationships.
DPR baseline—a neural dual-encoder retrieval model following the DPR paradigm, where queries and candidate passages are independently encoded into a shared dense vector space. In our implementation, both encoders are instantiated using GraphCodeBERT, a code-aware pretrained model that captures syntactic and semantic properties of the source code. The model is fine-tuned on the training split using defective contexts as queries and corresponding fix contexts as positive passages.

The NCKG-augmented prompts are compared against prompts using contexts retrieved by BM25 and the GraphCodeBERT-based dense retriever, enabling a controlled comparison between lexical, dense neural, and knowledge-graph-guided hybrid retrieval.

For the generation stage, six state-of-the-art LLMs from the HuggingFace Transformers library are employed to evaluate the effectiveness of the proposed knowledge-guided approach across different model architectures and scales:

GPT-Neo (125M, 1.3B, and 2.7B parameters)—autoregressive decoder-only models. Developed by EleutherAI, this family of models replicates the GPT-3 architecture using the open-source GPT-NeoX framework. These models were trained on The Pile, a diverse 825 GB English text corpus, providing strong general language-understanding capabilities.
Phi-2 (2.7B parameters)—developed by Microsoft, this compact transformer model demonstrates remarkable reasoning abilities despite its relatively small parameter count. Phi-2 employs innovative training techniques including “textbook-quality” synthetic data generation and reinforcement learning from human feedback (RLHF), resulting in superior performance on reasoning benchmarks compared to models of a similar size.
DeepSeekCoder (6.7B parameters)—a series of code-specialized LLMs pretrained on a corpus of 2 trillion tokens across 87 programming languages. In this study, we use the DeepSeekCoder-6.7B-Instruct version. It employs an advanced “fill-in-the-middle” training objective that enables bidirectional context awareness, making it particularly suitable for code completion and repair tasks.
CodeLlama (7B parameters)—Meta’s code-optimized variant of the Llama 2 architecture, further pretrained on 500B tokens of code-specific data. The model demonstrates state-of-the-art performance on code generation benchmarks including HumanEval and MBPP.

For each model, we use the same prompt template and retrieval context, enabling a controlled comparison of how different architectures leverage the provided semantic knowledge. During generation, we employ nucleus sampling with

p = 0.95

and a temperature of

τ = 0.8

to balance diversity and coherence. The maximum generation length is set to 512 tokens to accommodate typical function-level repairs.

4.3. Evaluation Metrics

4.3.1. Retrieval Metrics

Given a query defect, the retrieval system returns a ranked list of defect–fix pairs. We evaluate them based on the correctness of the mitigation strategy in the retrieved pairs, as the ultimate goal is to guide correct repairs.

1.: Exact Match@K: The proportion of queries where the top-K retrieved results contain at least one exact match of the ground-truth mitigation strategy. This measures the retrieval system’s ability to return perfectly relevant fixes within the top-K positions.

$Exact_Match@K = \frac{\sum_{i = 1}^{Q} I (\exists j \leq K : S_{i j} = S_{i}^{*})}{Q}$

(12)

where Q is the total number of queries, $S_{i j}$ is the strategy of the j-th retrieved result for query i, and $S_{i}^{*}$ is the ground-truth strategy.
2.: Reciprocal Rank (RR) and Mean Reciprocal Rank (MRR): The reciprocal of the rank at which the first exact strategy match occurs. For a single query q, the reciprocal rank is defined as

${RR}_{q} = \frac{1}{m i n {i : S_{i}^{q} = S_{q}^{*}}}$

(13)

with ${RR}_{q} = 0$ if no exact match is found. The MRR is the average of ${RR}_{q}$ across all queries:

$MRR = \frac{1}{| Q |} \sum_{q \in Q} {RR}_{q}$

(14)
3.: Overall Success Rate (OSR): The proportion of queries for which at least one exact strategy match exists within the entire retrieved list. In our implementation, we retrieve up to 10 results per query, so this is equivalent to EM@10. Formally,

$OSR = \frac{\sum_{i = 1}^{Q} I (\exists i \in {1, . . ., R} : S_{i j} = S_{i}^{*})}{Q}$

(15)

where R is the total number of retrieved results per query (10 in our experiments). Note that the OSR provides an upper bound on recall given the retrieval depth constraint.

These metrics provide a multi-faceted evaluation: EM@K measures recall at different cut-offs. The MRR evaluates ranking quality, and the OSR indicates the absolute recall capability given the retrieval depth. All metrics are reported as averages over the test set.

4.3.2. Generation Metrics

For each query, the top-K retrieved contexts (K = 1 in our experiments) are used to construct a prompt for the generative LLM, which then produces a repaired code snippet. We evaluate the generated repair using a two-stage process: automated LLM-based assessment followed by human validation.

1.

Automated Assessment with an LLM: We employ a structured prompt template to instruct a Deepseek-R1-8b model to act as a software repair evaluation expert. The template asks for three analyses:

Strategy Match: It measures whether the generated repair’s mitigation strategy aligns with the ground truth. The confidence score (match_confidence) directly serves as $S M \in [0, 1]$ .
Code Similarity: It quantifies resemblance to the ground-truth fix through three sub-scores: syntactic, semantic, and textual similarity.
Feasibility: It assesses practical viability through three sub-scores: compilation feasibility, logical correctness, and the risk of new bugs.

The output is constrained to a specific JSON format containing scores and reasoning. From this JSON, we extract numerical metrics and calculate their averages as final indicators. The rationale behind each metric’s score is also preserved to facilitate subsequent manual verification.

2.

Human Validation: To ensure reliability, a subset of the Deepseek evaluations (stratified by score ranges) is reviewed by human experts. Experts verify the correctness of the strategy classification and the plausibility of the similarity/feasibility scores, correcting any clear discrepancies. The final reported generation metrics are based on this validated set.

For final reporting, we evaluate the top-1 and top-3 generated repairs (ranking generation probabilities) using the overall score (avgOS), which is the average of the strategy match score, the code similarity score, and the feasibility score.

As discussed in Appendix B, the LLM-based evaluator represents a practical and extensible instantiation of the evaluation process. Individual dimensions—such as compilation feasibility in the feasibility dimension—can be replaced or augmented with static analysis, compilation checks, or dynamic execution when appropriate tooling and executable artifacts become available.

4.4. Implementation and Configuration

The NCKG method and all baselines are implemented in Python 3.10. For hybrid retrieval, the fusion parameter

γ

is set to 0.6, balancing the contributions of graph-based and vector-based similarity. The dense retriever component uses GraphCodeBERT as the encoder backbone in a dual-encoder (DPR-style) architecture. The model is fine-tuned on the training set using a contrastive loss objective, ensuring that both the dense baseline and the dense component of NCKG share identical representation capacity.

For the baseline models, the BM25 baseline is implemented using the rank_bm25 Python package with default parameters. The dense baseline shares the same GraphCodeBERT encoder, training data, and optimization settings as NCKG, ensuring a fair and controlled comparison.

All experiments are conducted on a server with NVIDIA A100 GPUs.

5. Results

5.1. RQ1: Comparative Evaluation of Retrieval Effectiveness

Table 3 presents the comprehensive retrieval performance of NCKG against two strong baselines, BM25 and DPR, across five key metrics. The results show that NCKG achieves notably higher retrieval performance than both traditional lexical retrieval and modern neural retrieval methods under the evaluated settings.

NCKG achieves the highest scores across all evaluated metrics. Most notably, it attains an Exact Match@1 score of 0.871, which is 2.6 times higher than that that for DPR (0.344) and 2.5 times higher than that for BM25 (0.333). This suggests that NCKG’s top-ranked retrieval result is more likely to employ the exact mitigation strategy required by the query, a critical feature for practical and automated repair assistance. The high Mean Reciprocal Rank (MRR) of 0.8297 further confirms that NCKG not only retrieves relevant matches more frequently but also ranks them higher in the result list.

A similar performance trend is observed across different recall depths, indicating that the advantage of NCKG is not limited to top-1 retrieval. At Exact Match@3, NCKG (0.5520) surpasses DPR (0.3333) by 65.5% and BM25 (0.2727) by 102.5%. At Exact Match@5, NCKG (0.4410) maintains a lead of 47.5% over DPR (0.2990) and 66.6% over BM25 (0.2647). The Overall Success Rate (OSR), which measures the proportion of queries for which at least one correct match is found within the retrieved set (10 in our experiment), reaches 0.9460 for NCKG. This corresponds to a high recall rate (94.6%) within the evaluated dataset, substantially exceeding those of DPR (0.6883) and BM25 (0.6777). The similar performance between BM25 and DPR suggests that pure text matching is insufficient for this task.

5.2. RQ2: Impact of Retrieved Context on Repair Generation

The generation performance, an overall score integrating strategy matching, code similarity, and feasibility, is shown in Figure 5 for both the top-1 and top-3 generation settings.

In both settings, the NCKG retrieval backbone consistently outperformed BM25 and DPR across all six generative models (GPT-Neo: 125M, 1.3B, and 2.7B; DeepSeek-Coder: 6.7B; and CodeLlama: 7B).

For top-1 generation tasks, the average overall score across all models with NCKG is 0.723, compared to 0.553 with BM25 and 0.565 with DPR, representing a relative improvement of 30.7% and 28.0%, respectively. The advantage of NCKG is most pronounced with the GPT-Neo 2.7B model, where it achieves an overall score of 0.744, surpassing BM25 (0.514) and DPR (0.561) by 44.6% and 32.6%, respectively.

Results for top-3 generation show a similar trend, with NCKG achieving an average Overall Score of 0.697 across all models, compared to 0.581 for BM25 and 0.566 for DPR. This corresponds to relative improvements of 20.0% and 23.1%, respectively. Notably, also with GPT-Neo 2.7B, NCKG reaches 0.738, which is 32.9% higher than DPR (0.555) and 29.4% higher than BM25 (0.570).

Among the models, the GPT-Neo family showed the greatest sensitivity to retrieval quality. NCKG provides a consistent performance gain under the evaluated configurations, particularly for the 125M parameter version, where it plays an important role in achieving competitive results. The performance of DeepSeekCoder is generally high and stable when using NCKG. The CodeLlama model also achieves high absolute scores overall.

5.3. RQ3: Ablation Study of the Hybrid Retrieval Mechanism

Figure 6 presents the results of the ablation study comparing the complete version of NCKG (hybrid) against its graph-only and vector-only variants in terms of both retrieval and final generation performances (using DeepSeek-Coder for its stable performance across multiple runs). The graph-only variant only uses the graph-based retrieval component (

γ = 1.0

), while the vector-only variant only uses the vector-based retrieval component (

γ = 0.0

), and the complete NCKG method runs with optimal fusion (

γ = 0.6

).

The hybrid version of NCKG achieves the best overall performance with an Exact Match@1 of 0.871, an MRR of 0.552, and an OSR of 0.946. The graph-only variant follows closely, with strong Exact Match@1 (0.871) and OSR (0.935), but its MRR (0.498) is notably lower than the that of the hybrid approach. The vector-only variant performs significantly worse across all metrics, with an Exact Match@1 of only 0.366.

The generation results, as measured by match confidence, similarity, feasibility, and their composite score, mirror the retrieval trends. The hybrid NCKG yields the best generation quality (composite score: 0.706). The graph-only variant lags slightly (0.693), and the vector-only variant performs the worst (0.604). This also demonstrates a direct correlation between retrieval precision and downstream generation quality.

6. Discussion

The experimental results comprehensively validate the proposed NCKG framework and its hybrid retrieval mechanism. The key findings and their implications are discussed below.

6.1. The Necessity of Structured Semantic Retrieval (RQ1)

The observed performance advantage of NCKG over lexical-based and neural-based retrieval baselines (BM25 and DPR) in RQ1 provides empirical evidence for our core hypothesis: effective repair knowledge retrieval requires modeling the explicit semantic relationships between defect symptoms, root causes, and mitigation strategies, rather than relying on surface-level text similarity.

The similar performance between BM25 and DPR reinforces the limitation of text-only retrieval for this task. Both methods treat defect–fix pairs as unstructured text passages, missing the structured relationships that define repair relevance. For instance, a query containing a code snippet like torch.softmax(x, dim = −1) that results in NaN may lexically match a fix that modifies a similar code token softmax, but DPR could also retrieve unrelated fixes that mention NaN in entirely different contexts, such as a data-loading routine. NCKG, by contrast, ensures that retrieved pairs share not just superficial code tokens but also a coherent semantic pattern of defect and repair via relational edges (e.g., linking the NaN symptom node to the softmax function node and further to the rewrite math formula strategy node), enabling it to retrieve pairs that share analogous repair logic even if their surface descriptions differ.

The observed performance gap suggests the practical importance of the proposed semantic framework as a prerequisite for effective retrieval. While text-based methods fail to distinguish between coincidental lexical matches and semantically relevant repair patterns, NCKG’s structure ensures that retrieved cases are linked by a shared, machine-interpretable semantic logic. This capability directly enables the transfer of repair strategies across heterogeneous code instances, a task for which unstructured retrieval proves inadequate.

Also, the explicit modeling of symptom–context–fix relationships benefits the repair strategy suggestion. The high Exact Match@1 and OSR scores demonstrate that NCKG can reliably serve as a highly accurate “look-up” mechanism for known defect patterns and a robust “recall” system for identifying semantically analogous cases, which are the foundation for effective repair generation.

In summary, the results for RQ1 affirm that NCKG provides a far more effective retrieval mechanism for numerical defect repair contexts compared to baseline retrievers, successfully aligning retrieved fixes with the required repair strategies. The results validate the core hypothesis that repair strategies are transferable across semantically similar defect contexts and that explicitly modeling these contexts through a structured semantic graph within the abstract conceptual level is key to unlocking this transferability.

6.2. From Accurate Retrieval to Reliable Generation (RQ2)

The results of RQ2 establish a direct causal link between retrieval quality and the reliability of LLM-based repair generation. The consistent performance gain when using the NCKG-retrieved context, across LLMs of varying scales and architectures, confirms that providing precise, semantically structured exemplars is more effective than providing keyword-based or generally relevant text matches. The retrieval context allows the model to better understand the specific defect pattern and the appropriate repair strategy, leading to generations that are more likely to be logically correct and syntactically valid.

The performance gains are consistent across model sizes, indicating that NCKG’s retrieval augmentation benefits both small and large language models. While larger LLMs have a stronger capacity to “ignore” irrelevant parts of the context or to “hallucinate” correct fixes from minimal clues, they still achieve their best performance when guided by the precise, relation-aware examples provided by NCKG. NCKG’s context reduces the LLM’s burden of inference and strategy selection, channeling its generative capacity towards synthesizing a correct fix based on a well-understood pattern. This is particularly crucial for smaller models (e.g., GPT-Neo 125M), where NCKG’s context is essential for achieving competitive results, effectively democratizing access to high-quality automated repair.

This finding validates the knowledge-guided generation pipeline. NCKG not only enhances retrieval metrics but also directly translates this enhancement into higher-quality generated repairs, validating its role as an effective augmentative component for LLM-based program repair systems.

6.3. Synergy of the Hybrid Design (RQ3)

The ablation study (RQ3) offers critical insights into the inner workings of NCKG and justifies its hybrid design. The graph-only variant’s strong Exact Match@1 performance highlights the indispensable role of structured semantic matching for precision—it excels at finding the single most relevant fix. However, its lower MRR suggests a weakness in optimally ranking a set of relevant results, potentially due to the discrete nature of graph matching.

The vector-only variant’s poor performance confirms that unstructured semantic similarity is too coarse for this precision-demanding task. The integration of the vector-based component (at an optimal fusion weight of

γ

= 0.6) addresses the ranking limitation of the graph component. It acts as a “smoothing” function, using continuous semantic embeddings to re-rank and diversify the candidate set retrieved by the graph, thereby improving the MRR and the overall quality of the context set for generation.

Therefore, the two components are complementary: the graph index provides the essential, precise semantic scaffolding, while the vector index enhances ranking robustness and semantic coverage. Their synergy is what enables NCKG to achieve state-of-the-art performance in both retrieval and the downstream generative task, demonstrating that structured semantic matching and dense semantic similarity are complementary and mutually reinforcing for the task of numerical defect repair.

6.4. Limitation and Future Work

Limitations: The primary limitation pertains to the scale of the dataset. Although carefully curated, the collection of real-world DNN numerical defect–fix pairs remains limited in size compared to broader software defect corpora, reflecting the inherent difficulty of annotating this specialized domain. This constraint may affect the generalizability and robustness of our findings. To mitigate associated threats to validity, all experiments were conducted with multiple random runs, with reported metrics representing averaged results. Given the general absence of reliable test suites for DNN numerical defects, the evaluation of generated repairs was conducted using a structured, LLM-assisted assessment template, followed by expert validation to ensure objectivity and reduce subjective bias.

Future Work: These limitations naturally lead to two main future directions. First, expanding the dataset with more diverse and extensive real-world examples is crucial for further validating and strengthening the approach. Second, and more fundamentally, the proposed unified conceptual semantic framework is designed not only for the immediate retrieval and generation task but also to serve as a foundational schema for systematically organizing knowledge in the domain of DNN numerical stability. Future work will leverage this framework to guide the consistent annotation and curation of larger datasets, enabling more comprehensive benchmarking and facilitating broader applications in DNN reliability analysis and repair.

7. Conclusions

This paper presented the Numerical–Conceptual Knowledge Graph-based (NCKG) method for the retrieval-augmented repair of numerical defects in deep neural networks. To bridge the semantic gap in unified knowledge representation and LLM-based repair, a structured semantic formalization that defines core entities and their relations is introduced, providing a unified, machine-interpretable schema for representing heterogeneous defect–repair knowledge. Based on this schema, a multi-view graph index is constructed, and a hybrid retrieval mechanism that integrates precise graph matching with semantic vector similarity is developed. This enables the retrieval of contextually relevant defect–fix pairs, which are then used within a knowledge-guided pipeline to augment LLMs for generating reliable repairs.

The experimental results indicate the effectiveness of the proposed method. NCKG outperforms the considered retrieval baselines (BM25 and DPR) on the evaluated dataset in strategy matching accuracy, validating that repair strategy transfer is governed by the structured semantic context. When used to augment various generative LLMs, the NCKG-retrieved context tends to yield higher-quality repairs, demonstrating that high-quality semantic grounding is critical for successful repair generation regardless of the scale of the generative model. Ablation studies confirm the complementary value of both graph-based and vector-based components in the hybrid design.

In summary, this work presents NCKG as an effective, semantics-driven paradigm that shows strong potential to enhance the reliability of automated numerical defect repair by bridging LLMs with structured domain knowledge. The positive results, which were obtained on a dataset at a specialized and limited scale, indicate a promising direction for DNN numerical defect repair generation.

Author Contributions

Conceptualization, J.L. and Q.Z.; methodology, J.L.; software, J.L.; validation, J.L., T.S., and Q.Z.; formal analysis, J.L.; investigation, J.L. and Q.Z.; resources, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, J.A., Q.Z., and T.S.; visualization, J.L.; supervision, J.A.; project administration, J.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code of the model implementation can be downloaded at (https://gitee.com/ljyyyy1/NCKG.git, accessed on 18 February 2026). The deepstability dataset can be downloaded from [5] (https://deepstability.github.io/, accessed on 18 February 2026). The full results can be provided on request.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments and suggestions for improving this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Detailed Definition of Conceptual Semantics for DNN Numerical Defect

Appendix A.1. Detailed Definition of Conceptual Semantics for DNN Numerical Defect

Numerical defects in deep neural networks arise from complex interactions between computational components, numerical operations, and execution conditions. To capture these aspects in a structured manner, numerical defects and their repairs are modeled using four core semantic elements: the execution context (

C

), numerical defect (

ND

), mitigation method (

M

), and constraint (

K

). These semantic elements are drawn upon established software engineering standards and numerical computing principles to ensure conceptual rigor and practical applicability.

Together, these elements describe the location of a defect, its observable manifestation, the applied repair, and the knowledge required to justify the repair. Each element is defined as a structured tuple of attributes, designed to capture the multifaceted nature of numerical instability in DNNs.

The execution context (

C

) semantic element represents the deep learning entity in which a numerical defect occurs. A component is characterized by its high-level category within the neural network architecture, the concrete function or API involved, and the underlying mathematical operation. This multi-level abstraction enables defect localization at different granularities, ranging from architectural components such as convolutional layers or optimizers to specific mathematical primitives such as exponentiation or division. The following provides a formal definition of

C

.

Definition A1 (Execution Context).

An Execution Context

C

represents a functional or structural element within a DNN pipeline where a numerical defect may manifest. It is defined as a tuple

C = (τ, f, o p)

, where

$τ \in T_{c}$ denotes the high-level phase or module type, drawn from the enumerated set $T_{c} = {Linear, Conv, RNN, LSTM, Optimizer, Loss, Data, Gradient, Quantization, Activation, Activation, TensorMath, Other}$ .
f is a descriptor specifying the particular function or sub-module (e.g., softmax).
$o p$ is a descriptor indicating the core mathematical operation involved (e.g., matrix multiplication, exponential, or normalization).

The numerical defect (

ND

) captures the abnormal numerical behavior exhibited during execution. It consists of the observable symptom, such as NaN, infinity, or numerical overflow, together with a semantic description of the root cause responsible for the anomaly. In addition, contextual information describing the conditions under which the defect arises is included to distinguish between defects with similar symptoms but different underlying causes. The following provides a formal definition of

ND

.

Definition A2 (Numerical Defect).

A numerical defect

N D

characterizes a specific instance of numerical misbehavior. It is defined as a tuple

N D = (σ, ρ, χ)

, where

$σ \in T_{a}$ is the observed symptom or anomaly type from the enumerated set $T_{a} = {NaN, Inf, Incorrect, Inaccurate, Undefined, RuntimeError, Other}$ .
ρ is a description of the hypothesized root cause (e.g., overflow, underflow, or GradientExplosion).
χ is a textual context providing additional situational details about the problem’s manifestation.

The mitigation method (

M

) element describes the repair applied to resolve a numerical defect. Mitigations are characterized by a high-level repair strategy, such as precision adjustment, mathematical reformulation, or algorithm substitution, and a concrete repair method that instantiates the strategy (if available). Modeling mitigations explicitly allows repairs to be reasoned about semantically rather than treated as independent code modifications.

Definition A3 (Mitigation Method).

A mitigation method M represents a corrective action or strategy applied to resolve a numerical defect. It is defined as a tuple

M = (ψ, μ)

, where

$ψ \in T_{m}$ is the high-level strategy, drawn from $T_{m} = {$ change variable type, increase variable precision, rewrite math formula, use a different algorithm, add warning, add overflow check, limit input range, Other}.
μ is a string descriptor specifying the concrete method or implementation of the strategy.

The constraint (

K

) element encodes auxiliary knowledge required to support or justify a defect or a repair. Such knowledge may originate from mathematical theory, implementation guidelines, or empirical observations shared in community discussions. By representing constraints explicitly, the semantic model preserves the rationale behind repair decisions and supports informed adaptation of repairs to new contexts.

Definition A4 (Constraint).

A constraint K encapsulates external knowledge or requirements that inform the validity and applicability of a mitigation method. It is defined as a tuple

K = (κ, ξ)

, where

$κ \in T_{k}$ is the knowledge type from $T_{k} = {Implementation, MathKnowledge, Forum}$ .
ξ is a textual knowledge context (e.g., a snippet of forum discussion, a mathematical principle, or a code implementation).

Appendix A.2. Automated Extraction Process and Semantic Concept Alignment

To ensure semantic consistency between NCKG and the foundational ontologies (FIDES/OSDEF), the proposed method reuses and specializes ontological concepts during the concept definition phase and achieves alignment with the ontological semantics through a controlled hybrid extraction process during implementation. This ontology-aligned concept layer provides explicit semantic guidance for the extraction function, thereby ensuring that extracted concepts remain ontologically valid and semantically grounded.

The top-level semantic elements of NCKG (C,

N D

, and M) and their core relations are based on the reuse and specialization of existing, mature ontological design patterns (ODPs) and conceptual classifications. This approach guarantees definitional consistency from the ground up, and the well-defined concepts, in turn, provide clear guidance for the subsequent extraction process.

1. Structural Reuse (from FIDES/EEPSA): The NCKG method directly adopts the execution–executor–procedure (EEP) and result–context (RC) Ontology Design Patterns (ODPs) from the EEPSA ontology. The definition logic of C,

N D

, and M directly instantiates these patterns. This pattern reuse ensures that when describing the core logical chain of “who (executor), in what context, did what (defective/corrective procedure), and led to what result,” the NCKG method maintains structural isomorphism with broader software and engineering ontologies.

2. Conceptual Grounding (from OSDEF): The distinction within the numerical defect (

N D

) entity between the root cause (

ρ

) and the observable symptom (

σ

) strictly follows the defect–error–failure conceptual taxonomy of the Ontology of Software Defects, Errors and Failures (OSDEF), which is itself grounded in the Unified Foundational Ontology (UFO). This ensures rigorous conceptual clarity in defining the core problem.

The detailed semantic mapping from these foundational ontologies to the NCKG elements is summarized in Table A1.

Table A1. Semantic mapping from foundational ontologies to NCKG elements.

Semantic Elements	Corresponding FIDES/OSDEP Knowledge	Source	Alignment Rationale and Purpose
Execution context (C), numerical defect ( $N D$ ), and mitigation method (M)	Execution–Executor–Procedure (EEP) ODP and Result–Context (RC) ODP	FIDES	Reusable structural pattern—Formally link processes (defect and fix), their executors (DNN component), and the circumstantial context of their occurrence.
Phase type ( $C . τ$ )	Model: What is the objective of the model?	FIDES	Modeling-phase alignment—Anchors the defect and repair within a specific stage of the ML’s life cycle, enabling phase-aware reasoning and mitigation.
Function descriptor ( $C . f$ )	Model: Which package implements the algorithm?	FIDES	Implementation context—Identifies the specific component where the defect manifests, linking the implementation-level context.
Operation descriptor ( $C . o p$ )	Model: What is the base algorithm of the ML-base model?	FIDES	Algorithmic context—Specifies the core algorithm or mathematical operation involved, connecting the computational foundation of the defect.
Symptom type ( $N D . σ$ )	Failure: A subtype of event that brings about a failure state.	OSDEF	Failure characterization—Aligns with observable failure events in software, providing a taxonomy for the external manifestations of numerical defects.
Root cause ( $N D . ρ$ )	Defect: A subtype of vulnerability that inheres in an object.	OSDEF	Defect root cause—Disambiguates the underlying bug from its symptomatic errors or failures.
Contextual description ( $N D . χ$ )	Error: The incorrect internal state.	OSDEF	Error description—Represents the erroneous computational state that bridges the internal defect and the external failure, enriching the semantic description for defect.
Constraint (K)	Execution-Executor-Procedure (EEP) ODP	FIDES	Validity anchoring—Provides a condition or property that anchors the validity of a procedure.

The extraction of information from raw data artifacts to NCKG semantic element instances is governed by a controlled hybrid extraction mechanism centered on these ontology-aligned, predefined semantic elements. In the specific implementation, the extraction function

Φ

ensures consistency through the following collaborative stages:

Ontology-Guided LLM Conceptual Analysis: Leveraging the semantic abstraction capability of large language models, raw textual artifacts (e.g., error messages, commit descriptions, and fix rationales) are analyzed and classified into predefined semantic element categories. This process is explicitly guided by structured prompts that enumerate the target semantic elements together with their ontology-aligned definitions, thereby constraining the LLM output to the established category system rather than allowing free-form generation.

Fuzzy Matching-Based Normalization: For high-level semantic elements (e.g., phase type, symptom type, and strategy type), the intermediate outputs produced by the LLM are further normalized via fuzzy string matching against a predefined candidate vocabulary. This step resolves lexical variations and near-synonymous expressions, ensuring that all extracted instances are mapped to canonical semantic anchors shared across the knowledge base and query-time extraction.

Manual Analysis as Benchmark: During the construction of the retrieval knowledge base, manually curated analysis results from domain datasets are referenced as a benchmark for the classification and instantiation of key semantic concepts. These manually aligned instances provide reliable grounding for the automated extraction process and serve to validate the correctness and coverage of the hybrid extraction mechanism.

Through this ontology-guided, hybrid extraction design, all instantiated NCKG entities remain semantically consistent with the FIDES and OSDEF ontologies at the concept definition level while enabling scalable and robust abstraction from heterogeneous and unseen data sources.

Appendix B. Extensible Evaluation Process for Generated Repairs

Automated evaluation of numerical defect repairs in deep neural network code presents unique challenges. Unlike traditional software repair benchmarks, most real-world DNN numerical defect cases lack executable environments, complete datasets, or reproducible training configurations. As a result, standard static compilation checks or dynamic test execution are often infeasible.

The evaluation protocol adopted in this work is designed to assess the quality of generated numerical defect repairs under these practical constraints. The LLM-based evaluator serves as a flexible semantic judge that can operate over heterogeneous and non-executable inputs while still providing structured and explainable assessment signals. Specifically, the evaluation decomposes repair quality into three complementary dimensions: Strategy Match (SM): Whether the generated repair applies a mitigation strategy consistent with the defect’s root cause. Code Similarity: How closely the generated repair aligns with known fixes at the syntactic, semantic, and textual levels. Feasibility: Whether the repair is practically applicable without introducing new issues.

Importantly, the above dimensions are not inherently tied to LLM-based evaluation. They are intentionally defined in an abstract manner to support alternative or stronger validation mechanisms:

Strategy match can be replaced or complemented by
-
Rule-based strategy classifiers;
-
Ontology-driven matching over defect–fix concepts;
-
Supervised classifiers trained on annotated defect–strategy pairs.
Code similarity can be computed using
-
CodeBLEU [41], which combines n-gram matching, AST structure matching, data-flow matching, and semantic weighting for code;
-
Embedding-based code similarity models independent of LLM reasoning.
Feasibility can be strengthened through
-
Static syntax checking and type checking;
-
Static analysis tools targeting numerical safety;
-
Execution-based validation when runnable environments and test data are available.

The LLM-based evaluator provides a lightweight and feasible approximation of repair correctness under current practical constraints, enabling empirical studies to be conducted while leaving room for more rigorous verification mechanisms to be integrated in the future.

References

Humbatova, N.; Jahangirova, G.; Bavota, G.; Riccio, V.; Stocco, A.; Tonella, P. Taxonomy of real faults in deep learning systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion 2020), Seoul, Republic of Korea, 5–11 October 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 1110–1121. [Google Scholar] [CrossRef]
Harzevili, N.S.; Shin, J.; Wang, J.; Wang, S.; Nagappan, N. Characterizing and Understanding Software Security Vulnerabilities in Machine Learning Libraries. In Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia, 15–16 May 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; pp. 27–38. [Google Scholar] [CrossRef]
Zhang, Y.; Ren, L.; Chen, L.; Xiong, Y.; Cheung, S.C.; Xie, T. Detecting numerical bugs in neural network architectures. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering; ACM: New York, NY, USA, 2020; pp. 826–837. [Google Scholar] [CrossRef]
Yan, M.; Chen, J.; Zhang, X.; Tan, L.; Wang, G.; Wang, Z. Exposing numerical bugs in deep learning via gradient back-propagation. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering; ACM: New York, NY, USA, 2021; pp. 627–638. [Google Scholar] [CrossRef]
Kloberdanz, E.; Kloberdanz, K.G.; Le, W. DeepStability: A study of unstable numerical methods and their solutions in deep learning. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; ACM: New York, NY, USA, 2022; pp. 586–597. [Google Scholar] [CrossRef]
Wang, G.; Wang, Z.; Chen, J.; Chen, X.; Yan, M. An Empirical Study on Numerical Bugs in Deep Learning Programs. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Rochester, MI, USA, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, X.; Zhai, J.; Ma, S.; Shen, C. AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; IEEE Press: New York, NY, USA, 2021; pp. 359–371. [Google Scholar] [CrossRef]
Wardat, M.; Cruz, B.D.; Le, W.; Rajan, H. DeepDiagnosis: Automatically diagnosing faults and recommending actionable fixes in deep learning programs. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; ACM: New York, NY, USA, 2022; pp. 561–572. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Ren, L.; Xiong, Y.; Xie, T. Reliability Assurance for Deep Neural Network Architectures Against Numerical Defects. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; IEEE Press: New York, NY, USA, 2023; pp. 1827–1839. [Google Scholar] [CrossRef]
Shang, Y.; Liu, S. LRNN: A Formal Logic Rules-Based Neural Network for Software Defect Prediction. In Proceedings of the Formal Methods and Software Engineering: 25th International Conference on Formal Engineering Methods, ICFEM 2024, Hiroshima, Japan, 2–6 December 2024; Springer Nature: Berlin/Heidelberg, Germany, 2024; pp. 106–124. [Google Scholar] [CrossRef]
Abdu, A.; Zhai, Z.; Abdo, H.A.; Lee, S.; Al-masni, M.A.; Gu, Y.H.; Algabri, R. Cross-project software defect prediction based on the reduction and hybridization of software metrics. Alex. Eng. J. 2025, 112, 161–176. [Google Scholar] [CrossRef]
Yamaguchi, F.; Golde, N.; Arp, D.; Rieck, K. Modeling and discovering vulnerabilities with code property graphs. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 18–21 May 2014; IEEE Computer Society: Washington, DC, USA, 2014; pp. 590–604. [Google Scholar]
Xu, J.; Ai, J.; Liu, J.; Shi, T. ACGDP: An Augmented Code Graph-Based System for Software Defect Prediction. IEEE Trans. Reliab. 2022, 71, 850–864. [Google Scholar] [CrossRef]
Radjenović, D.; Heričko, M.; Torkar, R.; Živkovič, A. Software fault prediction metrics: A systematic literature review. Inf. Softw. Technol. 2013, 55, 1397–1418. [Google Scholar] [CrossRef]
Muthukumaran, K.; Choudhary, A.; Murthy, N.B. Mining GitHub for novel change metrics to predict buggy files in software systems. In Proceedings of the 2015 International Conference on Computational Intelligence and Networks, Odisha, India, 12–13 January 2015; IEEE Press: New York, NY, USA, 2015; pp. 15–20. [Google Scholar]
Ai, J.; Su, W.; Zhang, S.; Yang, Y. A software network model for software structure and faults distribution analysis. IEEE Trans. Reliab. 2019, 68, 844–858. [Google Scholar] [CrossRef]
Phan, A.V.; Le Nguyen, M.; Bui, L.T. Convolutional neural networks over control flow graphs for software defect prediction. In Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November 2017; IEEE Press: New York, NY, USA, 2017; pp. 45–52. [Google Scholar]
Wang, S.; Liu, T.; Nam, J.; Tan, L. Deep semantic feature learning for software defect prediction. IEEE Trans. Softw. Eng. 2018, 46, 1267–1293. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, B.; Li, G.; Liu, H.; Jin, Z. Precise learning of source code contextual semantics via hierarchical dependence structure and graph attention networks. J. Syst. Softw. 2022, 184, 111108. [Google Scholar] [CrossRef]
Wang, L.; Sun, C.; Zhang, C.; Nie, W.; Huang, K. Application of knowledge graph in software engineering field: A systematic literature review. Inf. Softw. Technol. 2023, 164, 107327. [Google Scholar] [CrossRef]
Schumi, R.; Sun, J. ExAIS: Executable AI semantics. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; ACM: New York, NY, USA, 2022; pp. 859–870. [Google Scholar] [CrossRef]
Zhou, Q.; Zhou, D.; Dai, C.; Chen, J.; Guo, Z. Knowledge-driven innovation in industrial maintenance: A neural-enhanced model-based definition framework for lifecycle maintenance process information propagation. J. Manuf. Syst. 2025, 82, 976–999. [Google Scholar] [CrossRef]
Xia, L.; Liang, Y.; Leng, J.; Zheng, P. Maintenance planning recommendation of complex industrial equipment based on knowledge graph and graph neural network. Reliab. Eng. Syst. Saf. 2023, 232, 109068. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Chang, B. A survey on in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 1107–1128. [Google Scholar]
Yin, X.; Ni, C.; Wang, S.; Li, Z.; Zeng, L.; Yang, X. Thinkrepair: Self-directed automated program repair. In ISSTA 2024: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, Vienna, Austria, 16–20 September 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1274–1286. [Google Scholar]
Liu, Z.; Du, X.; Liu, H. ReAPR: Automatic program repair via retrieval-augmented large language models. Softw. Qual. J. 2025, 33, 30. [Google Scholar] [CrossRef]
Trotman, A.; Puurula, A.; Burgess, B. Improvements to BM25 and language models examined. In ADCS ’14: Proceedings of the 19th Australasian Document Computing Symposium; Association for Computing Machinery: New York, NY, USA, 2014; pp. 58–65. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 6769–6781. [Google Scholar]
Zhang, Q.; Chen, S.; Bei, Y.; Yuan, Z.; Zhou, H.; Hong, Z.; Dong, J.; Chen, H.; Chang, Y.; Huang, X. A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models. arXiv 2025, arXiv:2501.13958. [Google Scholar]
Li, Z.; Chen, X.; Yu, H.; Lin, H.; Lu, Y.; Tang, Q.; Huang, F.; Han, X.; Sun, L.; Li, Y. Structrag: Boosting knowledge intensive reasoning of llms via inference-time hybrid information structurization. arXiv 2024, arXiv:2410.08815. [Google Scholar]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Sun, J.; Xu, C.; Tang, L.; Wang, S.; Lin, C.; Gong, Y.; Ni, L.M.; Shum, H.Y.; Guo, J. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv 2023, arXiv:2307.07697. [Google Scholar]
Ma, S.; Xu, C.; Jiang, X.; Li, M.; Qu, H.; Guo, J. Think-on-graph 2.0: Deep and interpretable large language model reasoning with knowledge graph-guided retrieval. arXiv 2024, arXiv:2407.10805. [Google Scholar]
Liu, J.; Ai, J.; Su, H.; Shi, T. Enhancing Reliability Assurance for DNN against Numerical Defect with Large Language Models. In Proceedings of the 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE), São Paulo, Brazil, 21–24 October 2025; IEEE Press: New York, NY, USA, 2025; pp. 300–310. [Google Scholar] [CrossRef]
Fernandez, I.; Aceta, C.; Gilabert, E.; Esnaola-Gonzalez, I. FIDES: An ontology-based approach for making machine learning systems accountable. J. Web Semant. 2023, 79, 100808. [Google Scholar] [CrossRef]
Duarte, B.B.; Falbo, R.A.; Guizzardi, G.; Guizzardi, R.S.; Souza, V.E. Towards an Ontology of Software Defects, Errors and Failures; Springer: Berlin/Heidelberg, Germany, 2018; pp. 349–362. [Google Scholar]
Guizzardi, G. Ontological Foundations for Structural Conceptual Models. Ph.D. Thesis, University of Twente, Enschede, The Netherlands, 2005. [Google Scholar]
Esnaola-Gonzalez, I.; Bermúdez, J.; Fernandez, I.; Arnaiz, A. EEPSA as a core ontology for energy efficiency and thermal comfort in buildings. Appl. Ontol. 2021, 16, 193–228. [Google Scholar] [CrossRef]
Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed method.

Figure 2. Conceptual semantic elements and relation schema.

Figure 3. Demonstration of semantic index Construction.

Figure 4. Proportion of dataset type.

Figure 5. Generation performance comparison across different retrieval methods and generation models. (a) Generation performance considering top-1 generation. (b) Generation performance considering top-3 generation.

Figure 6. Retrieval and generation results comparing variants of NCKG.

Table 1. Semantic element definitions for DNN numerical defects.

Semantic Elements	Symbol	Description	Primary Attribute
Execution Context	$C$	(WHEN and WHERE) denote the architectural or algorithmic module within a DNN pipeline.	Phase type ( $τ$ ), function descriptor (f), and operation descriptor ( $o p$ )
Numerical Defect	$ND$	(WHAT) characterizes the problem and observable manifestation of the numerical defect.	Symptom type ( $σ$ ), root cause ( $ρ$ ), and contextual description ( $χ$ )
Mitigation Method	$M$	(HOW) categorizes the approach for repairing or mitigating the defect.	Strategy type ( $ψ$ ) and method descriptor ( $μ$ )
Constraint	$K$	(WHY) documents the origin of the background knowledge.	Knowledge type ( $κ$ ) and knowledge context ( $ξ$ )

Table 2. Formal semantic relation schema R.

Relation Name r	Source S	Target T	Semantic Description $ϕ$
phase_defines	$C . τ$	$C . f$	A component’s high-level type logically encompasses or is implemented by specific functions.
operation_implements	C.op	$C . f$	A mathematical operation (e.g., exp) is a constituent part or the core computation within a specific function.
symptom_manifests_in	$N D . σ$	$C . f$	A specific symptom is observed during the execution of a particular function.
symptom_indicates	$N D . σ$	$N D . ρ$	A manifested symptom implies or is directly caused by an underlying root cause (e.g., Inf symptom caused by “division by zero”).
context_informs_cause	$N D . χ$	$N D . ρ$	The problem context contains information that further explain the root cause.
cause_suggests_method	$N D . ρ$	$M . μ$	An identified root cause dictates or strongly suggests a specific mitigation method.
context_suggests_method	$N D . χ$	$M . μ$	The problem context dictates or strongly suggests a specific mitigation method.
strategy_generalizes	$M . ψ$	$M . μ$	A high-level repair/mitigation strategy is concretely implemented by a specific method.
knowledge_explains_context	$K . ξ$	$N D . χ$	External knowledge (e.g., a forum post, a numerical stability principle) provides the rationale or explanation for the observed problem context.
knowledge_motivates_strategy	$K . ξ$	$M . ψ$	External knowledge (e.g., a mathematical principle) motivates or justifies a repair strategy.
operation_constrained_by	C.op	$K . ξ$	A particular mathematical operation is associated with specific background knowledge (e.g., “numerical stability trick: subtract max(logit)”).
type_of_knowledge	$K . κ$	$K . ξ$	The knowledge type categorizes the nature or provenance of the knowledge context.

Table 3. Experimental result of retrieval effectiveness.

	Exact Match @1	Exact Match @3	Exact Match @5	Mean Reciprocal Rank	Overall Success Rate
BM25	0.3330	0.2727	0.2647	0.4610	0.6777
DPR	0.3443	0.3333	0.2990	0.4540	0.6883
NCKG	0.8710	0.5520	0.4410	0.8297	0.9460

Note: Bold values indicate the best performance in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Zhou, Q.; Ai, J.; Shi, T. Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval. Appl. Sci. 2026, 16, 2124. https://doi.org/10.3390/app16042124

AMA Style

Liu J, Zhou Q, Ai J, Shi T. Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval. Applied Sciences. 2026; 16(4):2124. https://doi.org/10.3390/app16042124

Chicago/Turabian Style

Liu, Jingyu, Qidi Zhou, Jun Ai, and Tao Shi. 2026. "Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval" Applied Sciences 16, no. 4: 2124. https://doi.org/10.3390/app16042124

APA Style

Liu, J., Zhou, Q., Ai, J., & Shi, T. (2026). Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval. Applied Sciences, 16(4), 2124. https://doi.org/10.3390/app16042124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Repairing DNN Numerical Defects with Semantic-Driven Knowledge Graph Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Knowledge Engineering for Software Tasks

2.2. Prompt Engineering and Retrieval-Augmented Generation for LLMs in Code Processing

2.3. Automated Repair for DNN Numerical Defects

3. Methodology

3.1. Unified Framework of Conceptual Semantics and Relations for DNN Numerical Defect

3.1.1. Numerical Defect Semantic Elements Definition

3.1.2. Semantic Relation Schema

3.2. Graph-Index Construction via Semantic Subgraphs

3.3. Hybrid Semantic Retrieval over Graph Indices

3.3.1. Query Representation

3.3.2. Subgraph-Based Similarity Retrieval

3.3.3. Multi-View Score Aggregation and Hybrid Retrieval

3.4. Knowledge-Guided Repair Generation

3.4.1. Prompt Design for Contextual Repair Generation

3.4.2. Generation and Post-Processing

4. Experimental Setup

4.1. Dataset

4.2. Baseline Methods

4.3. Evaluation Metrics

4.3.1. Retrieval Metrics

4.3.2. Generation Metrics

4.4. Implementation and Configuration

5. Results

5.1. RQ1: Comparative Evaluation of Retrieval Effectiveness

5.2. RQ2: Impact of Retrieved Context on Repair Generation

5.3. RQ3: Ablation Study of the Hybrid Retrieval Mechanism

6. Discussion

6.1. The Necessity of Structured Semantic Retrieval (RQ1)

6.2. From Accurate Retrieval to Reliable Generation (RQ2)

6.3. Synergy of the Hybrid Design (RQ3)

6.4. Limitation and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Detailed Definition of Conceptual Semantics for DNN Numerical Defect

Appendix A.1. Detailed Definition of Conceptual Semantics for DNN Numerical Defect

Appendix A.2. Automated Extraction Process and Semantic Concept Alignment

Appendix B. Extensible Evaluation Process for Generated Repairs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI