3.1. Empirical Analysis of Graph and Semantic Pathologies
Before evaluating the downstream QA performance, we first analyze the structural and semantic characteristics of the automatically constructed knowledge graph. Our analysis reveals that power-system regulatory corpora exhibit severe graph sparsity and semantic redundancy, which directly challenge the assumptions underlying standard GraphRAG retrieval.
3.1.1. Graph Structural Degeneration
Table 1 summarizes the structural statistics of the generated knowledge graph. The graph contains 2556 nodes but only 272 edges, resulting in an extremely sparse topology with a density of
. More critically, 2246 nodes (87.87%) are completely isolated, while 96.87% of all nodes have degree less than or equal to one.
These observations indicate that the graph does not satisfy the connectivity assumptions required by neighborhood-expansion-based GraphRAG frameworks. Instead of forming semantically navigable communities, the graph degenerates into a large collection of fragmented and weakly connected components. Furthermore, the PageRank distribution is nearly uniform for 92.95% of nodes, suggesting that the graph lacks sufficiently discriminative structural hubs for reliable multi-hop reasoning.
This phenomenon quantitatively validates the existence of the Topological Drift problem discussed in
Section 3, where automatically extracted entities fail to preserve the true hierarchical business logic of the power-system domain.
3.1.2. Semantic Redundancy and Embedding Collapse
We further analyze the semantic distribution of document chunks in the embedding space. As shown in
Table 2, the average pairwise cosine similarity between document embeddings reaches 0.8062, while the average nearest-neighbor similarity further increases to 0.9255.
The unusually high embedding similarity demonstrates that power-system regulations occupy an overly concentrated latent semantic space. High-frequency operational terms such as “system”, “control”, and “requirement” repeatedly appear across different technical clauses, causing semantically distinct fragments to become densely clustered in the vector space.
This embedding-space crowding substantially weakens the discriminative capability of cosine-similarity-based retrieval and quantitatively confirms the Semantic Dilution phenomenon introduced earlier. Consequently, standard vector retrieval and community summarization methods tend to retrieve broad but weakly relevant context, increasing hallucination risk during generation.
3.3. Main Results
The evaluation results of our proposed framework in comparison with baseline models are summarized in
Table 4 and
Table 5. Across all key dimensions of the RAG pipeline—retrieval accuracy, generative faithfulness, and semantic alignment—the KG-Anchored RAG exhibits a consistent performance advantage over standard GraphRAG and Hybrid-Search methods.
As observed in
Table 4, the proposed method achieves a Context Precision of 0.4400 and a Faithfulness score of 0.3835, demonstrating a stable performance advantage over the baselines. In contrast, GraphRAG-Local scores 0.1209 and 0.1769 in these respective metrics. The lower precision in the baseline models is primarily attributed to the increased semantic entropy within the expanded 122-unit retrieval space. In power system regulations, high-frequency generic terms create dense clusters that traditional community-based summarization fails to differentiate. While the GraphRAG baselines often retrieve broad neighborhood summaries that dilute specific technical evidence, our method utilizes the refined topological skeleton to navigate the Knowledge Attachment Matrix. By establishing precise coordinates between core concepts and document units, this mechanism effectively filters out generic operational noise and anchors the retrieval onto the most relevant evidentiary segments.The improvement in generative integrity is further evidenced by the doubling of the Faithfulness score compared to the baselines. This technical efficiency is most notably highlighted by the ROUGE-2 F1 scores; our method (0.0482) is greatly higher than that of GraphRAG-Local (0.0031). This substantial margin confirms that our proposed approach successfully preserves rigid technical efficiency.
The impact of the retrieval budget (
K) on the ranking quality, as summarized in
Table 5, reveals a consistent performance advantage for our proposed framework over the GraphRAG-Local baseline. Across all tested budgets, the native matrix-driven ranking—further refined by Personalized PageRank (PPR)—demonstrates superior ability in prioritizing relevant technical evidence. As shown in
Table 5, the ranking quality for both models improves as
K increases from 3 to 5. This suggests that with the increased complexity of 122 document chunks, a very restricted budget (
) may occasionally exclude primary evidence from the candidate pool. However, even at this small scale, our method achieves an
nDCG of 0.4664, which is 43.0% higher than the 0.3261 recorded by GraphRAG-Local. This margin confirms that our precision-guided anchoring is more efficient at identifying core technical clauses with minimal data compared to the summarization-heavy approach of the baseline. The highest ranking performance for our method is achieved at
, reaching an
nDCG of 0.6480. Beyond this point, as seen at
, the score for both models begins to decline. This trend supports the observation that in specific domains, increasing the retrieval volume beyond a certain threshold introduces secondary, less relevant document units. These document units create semantic interference, which slightly complicates the ranking process. Nevertheless, our framework maintains a steady lead at
, outperforming the baseline by 16.6%. This robustness is attributed to the structural resonance mechanism, which effectively filters out the additional noise and maintains the priority of the most evidentiary knowledge cells at the top of the context list.
Statistical Significance Analysis
To evaluate the robustness of the observed improvements, we further conducted statistical significance testing across all evaluation metrics. Each experiment was repeated multiple times under different random seeds, and paired significance tests were performed between the proposed method and competing baselines.
Table 6 reports the corresponding
p-values against the strongest baseline methods. Across all major metrics, the proposed KG-Anchored RAG achieves improvements (
), with most metrics reaching stronger significance levels.
The statistical results confirm that the observed performance gains are not caused by random fluctuations, but instead arise from the consistent effectiveness of the proposed topology-guided retrieval framework.
3.4. Ablation Study and Hyperparameter Sensitivity Analysis
To verify the scientific necessity of each module within the KG-Anchored RAG and to evaluate its robustness under varying configurations, we conducted a series of ablation and sensitivity experiments.
The comprehensive ablation analysis presented in
Table 7 confirms that each modular component of the KG-Anchored RAG framework is essential for maintaining both retrieval accuracy and generative integrity. The data indicates that the removal of any single module leads to a measurable decline across all performance indicators, with different modules serving distinct roles in the Anchoring–Navigation–Generation pipeline.
As evidenced by the results, the PageRank-based module is a key factor for generative reliability. Its removal (Variant w/o PageRank) causes the most dramatic collapse in faithfulness, dropping from 0.3835 to a mere 0.1074, and the lowest recorded answer relevancy. This confirms that even if relevant documents are present in the candidate pool, the lack of structural consensus provided by the PageRank-based algorithm results in the LLM receiving conflicting or noisy signals, which triggers speculative hallucinations and logical incoherence.
The Semantic Bridge () serves as the primary anchor for retrieval specificity. Excluding this vector-space mapping (Variant w/o Semantic Bridge) leads to the lowest context precision. Without the alignment provided by , the system fails to bridge the gap between abstract topics and the structured skeleton, effectively reverting to a sparse literal search that is unable to handle specific terminology or technical abbreviations. This disconnect is also reflected in the drop in BERTScore recall to 0.7279, indicating that a portion of the evidentiary chain is lost when the semantic bridge is broken.
The impact of noise suppression is illustrated by the Skeleton Refinement ablation. Removing the skeleton refinement results in an increase in average context length to 2427 characters. This suggests that without a refined skeleton, the navigational space is overwhelmed by generic high-frequency terms, causing the “feature drowning” effect where core technical evidence is buried under operational noise.
Finally, the Non-linear Sharpening module is shown to be vital for retrieval focus. Without the high-temperature Softmax polarization (Variant w/o Non-linear Sharpening), the system exhibits a sharp decline in context recall and faithfulness. This data validates the necessity of our approach; without it, the retrieval signals remain too diffuse to consistently prioritize the correct regulatory clauses within the Knowledge Attachment Matrix. Overall, the ablation study demonstrates that the superior performance of KG-Anchored RAG is derived from the synergistic integration of topological filtering, latent semantic mapping, and structural graph resonance.
The sensitivity of the retrieval scope is analyzed through Top-
c and Top-
g in
Table 8 and
Table 9. The results indicate that performance peaks at moderate retrieval volumes, following a “less-is-more” principle regarding information density. In
Table 8,
represents the optimal balance, achieving the highest Faithfulness (0.3835) and BERTScore F1 (0.6740). Similarly,
Table 9 shows that the most precise signal is captured at
, yielding peak Context Precision (0.4910) and Faithfulness (0.4050).
As the retrieval budget expands to or , while Context Recall expectedly improves, Faithfulness suffers a sharp decline (dropping below 0.20). This collapse suggests that larger retrieval windows introduce redundant, weakly-related segments that create “semantic interference.” In professional domains, these noisy distractors overwhelm the LLM’s reasoning process, proving that maintaining a high signal-to-noise ratio within a compact context is more effective for high-fidelity answering than broad information gathering.
The impact of the internal mapping and activation parameters is detailed in
Table 10 and
Table 11. As shown in
Table 10, the Similarity Threshold (
) exhibits an optimal value at 0.75, yielding the highest Faithfulness (0.3835) and Context Precision (0.4400). Lowering the threshold to 0.65 increases the Context Recall to 0.3710 by allowing more permissive mapping between topics and the skeleton, but it introduces noise that degrades the Faithfulness score to a minimum of 0.1035. Conversely, a threshold of 0.85 is overly restrictive, resulting in broken navigational paths and a minimal recall of 0.1000.
The Sharpening Temperature (
) analysis in
Table 11 demonstrates that
provides the most effective discrimination. At
(linear weighting), the system struggles with poor selectivity across the document space, reflected in the lowest scores for precision and faithfulness. In contrast,
effectively polarizes the query vector, ensuring that relevant knowledge nodes are prioritized to provide high-density context, which ultimately results in the highest BERTScore F1 (0.6740) and Answer Relevancy (0.6686). These results confirm that the selected hyper-parameters represent a stable high-performance equilibrium for the KG-Anchored RAG framework.
3.5. Case Study and Failure Analysis
To further understand the behavior of different retrieval strategies under professional QA settings, we conducted a qualitative analysis on representative failure cases.
3.5.1. Failure of Community-Based Retrieval
A representative example is the query:
“What is the definition of step voltage in electrical safety terminology?”
The correct answer explicitly requires the regulatory definition:
“the voltage between two points on the ground separated by 1 m (0.8 m in Chinese standards).”
However, both GraphRAG-Local and Hybrid retrieval fail to recover the exact clause containing the numerical constraint. Instead, they retrieve broad conceptual explanations related to grounding safety and electrical shock hazards.
This failure occurs because the relevant clause is embedded within a highly repetitive safety-regulation context. Under severe semantic redundancy, community-based retrieval tends to prioritize globally central descriptions rather than numerically precise technical definitions.
In contrast, the proposed KG-Anchored retrieval successfully activates the corresponding skeleton concepts associated with “electrical safety” and “voltage definition”, allowing the system to directly navigate toward the evidentiary clause instead of relying on neighborhood expansion.
3.5.2. Observed Limitations
Although the proposed framework substantially improves retrieval precision, several limitations remain. In terminology-definition tasks where the exact clause is absent from the retrieval pool, the model may still generate partially generalized explanations. This issue is particularly evident for highly ambiguous terms such as “real-time database” or “reclosing success”, where semantic overlap exists across multiple engineering subdomains.
These observations suggest that the effectiveness of the framework still depends on the completeness of the underlying skeleton construction and the coverage quality of the initial document chunking process.