IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment

Ye, Zhenhao; Qi, Donglian; Liu, Hanlin; Zhang, Siqi

doi:10.3390/electronics14163284

Open AccessArticle

IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment

¹

College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China

²

School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3284; https://doi.org/10.3390/electronics14163284

Submission received: 22 June 2025 / Revised: 8 August 2025 / Accepted: 10 August 2025 / Published: 19 August 2025

(This article belongs to the Special Issue Advances in Condition Monitoring and Fault Diagnosis)

Download

Browse Figures

Versions Notes

Abstract

Condition assessment of power equipment is crucial for optimizing maintenance strategies. However, knowledge-driven approaches rely heavily on manual alignment between equipment failure characteristics and guideline information, while data-driven methods predominantly depend on on-site experiments to detect abnormal conditions. Both face challenges in terms of inefficiency and timeliness limitations. With the growing integration of information systems, a significant portion of condition assessment-related information is represented in textual formats, such as system alerts and experimental records. Although Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) show promise in processing such text-based information, their practical application is constrained by LLMs’ hallucinations and RAG’s coarse-grained retrieval mechanisms, which struggle with semantically similar but contextually distinct guideline items. To address these issues, this paper proposes an enhanced RAG framework that integrates hierarchical and global retrieval mechanisms (IHGR-RAG). The framework comprehensively incorporates three optimization strategies: a query rewriting mechanism based on few-shot learning prompt engineering, an integrated approach combining hierarchical and global retrieval mechanisms, and a zero-shot chain-of-thought generation optimization pipeline. Additionally, a Task-Specific Quantitative Evaluation Benchmark is developed to rigorously evaluate model performance. Experimental results indicate that IHGR-RAG achieves accuracy improvements of

4.14 %

and

5.12 %

in the task of matching the solely correct guideline item, compared to conventional RAG and standalone hierarchical methods, respectively. Ablation studies confirm the effectiveness of each module. This work advances dynamic health monitoring for power equipment by balancing interpretability, accuracy, and domain adaptability, providing a cost-effective optimization pathway for scenarios with limited annotated data.

Keywords:

condition assessment; power equipment; retrieval-augmented generation; large language models

1. Introduction

Condition assessments are essential for maintaining power system stability by enabling proactive anomaly detection, optimized maintenance, and renovation strategies for power equipment, critically determining operational reliability [1,2]. The conventional condition assessment approach, which relies on outage-based testing and manual records, is limited in operational efficiency and timeliness. With the wide application of intelligent operation and maintenance technology and power information systems, most of the information related to the condition assessment of power equipment is circulated and presented in the form of text, such as system alarms and work records [3]. By introducing the Health Index (HI) [4] and establishing assessment guidelines, the extracted texts can be automatically matched with the corresponding deduction values in the guidelines. The health score can be adjusted in real time based on accumulated deduction values, allowing a comprehensive, interpretable, and component-level condition assessment.

Large language models, with their outstanding knowledge representation and reasoning capabilities, have demonstrated significant advantages in the extraction of electrical text data [5,6], creating new opportunities to conduct an end-to-end condition assessment of equipment based on knowledge- and data-driven methods. However, owing to the absence of domain-specific knowledge in LLMs’ pretraining corpora, sole reliance on LLMs for precise retrieval of guideline items associated with equipment failure features may induce hallucination phenomena. These typically manifest as the target deviation and non-sensical generation, while also exhibiting unpredictability and black-box characteristics in decision making processes [7,8].

To alleviate the limitations of LLMs, retrieval-augmented generation (RAG) can significantly enhance the precision and relevance of model outputs by introducing domain-specific knowledge as external data sources [9,10]. Specifically, RAG can integrate multiple retrieved guideline items related to equipment failure feature descriptions, using the combined information as input for LLMs to ensure that the generated content is grounded in the retrieved evidence. However, the conventional RAG method employs a global retrieval mechanism, taking the entire text as input. Due to its relatively coarse retrieval granularity, it has difficulty meeting the demand for high-precision generation of a single guideline item. Using the transformer equipment guideline as an illustrative example, each entry within the guideline comprises multiple fields, as depicted in Figure 1. Identical components and their functional elements may contain failure modes and deduction criteria with semantic similarity yet divergent expressions. Similarly, the same deduction criteria may be associated with different components, functional elements, or failure modes in the guideline. Table A1 in Appendix A provides an example of the transformer guideline format. This complexity poses a significant challenge for the conventional RAG method.

To address the deficiencies in retrieval accuracy associated with the conventional RAG method and to mitigate the “hallucinations” produced by LLMs, this paper proposes an innovative RAG framework that integrates hierarchical and global retrieval mechanisms (IHGR-RAG). This framework is designed to generate unique guideline items based on the corresponding text. It introduces a hierarchical and global retrieval mechanism, achieving fine-grained retrieval by explicitly adding a hierarchical retrieval stage. Specifically, the framework incorporates a query rewriting mechanism based on prompt engineering with few-shot learning, along with a guideline matching mechanism that integrates reranking and the chain-of-thought (CoT) method.

Furthermore, the output results of the model, including deduction values and criteria, play a crucial role in the condition assessment. While a series of end-to-end automatic evaluation benchmarks have recently emerged, these benchmarks rely on open-domain information or judgments generated by LLMs [11,12]. To fully satisfy the requirements for evaluating the uniqueness and accuracy of the output guideline items, this paper proposes an automatic benchmark specifically tailored for the knowledge- and data-driven approach. This benchmark covers all output results from both the retrieval and generation stages, thereby ensuring comprehensive and effective evaluation of the model.

The contributions of this paper can be summarized as follows:

Integrated Hierarchical-Global Retrieval Architecture: We design a hybrid approach to enhance the precision of guideline retrieval. In the two-stage hierarchical retrieval, we rewrite queries using few-shot prompt engineering and facilitate context selection through dynamic indexing.
Zero-Shot Generation Optimization Pipeline: We develop a generation enhancement strategy that combines reranking, chain-of-thought prompting, and the longest common substring method, effectively minimizing the risks of hallucination without additional training.
Task-Specific Quantitative Evaluation Benchmark: We propose an evaluation metric system with mathematical formulations that encompass both the retrieval and generation stages, specifically tailored for the task of power equipment condition assessment.

2. Related Work

2.1. Condition Assessment of Power Equipment

A condition assessment is a method that determines the actual operating condition of primary power equipment by analyzing and evaluating various indicators associated with the equipment [13]. Existing knowledge-driven research typically employs the HI as a quantitative assessment benchmark, establishing the theoretical health state of newly commissioned equipment as the initial value (HI score = 100) [14,15]. A dynamic deduction mechanism is then implemented based on the degree of deterioration of component-level parameters. Ultimately, the equipment condition is classified using thresholds of the health index. Analyzing failure information and assessing conditions through periodic manual scoring depend on the experience of operation and maintenance personnel. However, this method has limitations: manual scoring efficiency decreases exponentially with equipment scale; information coverage is incomplete; and expert cognitive differences make it hard to maintain consistent scoring standards.

The data-driven approach achieves the automatic calculation of health indices by constructing a multi-dimensional feature index system and applying algorithms such as the entropy weight–analytic hierarchy process [16], uncertainty reasoning [17], machine learning [18], and deep learning [19]. Although this method enhances assessment efficiency, the feature index system still has the problems of having a few dimensions and strong subjectivity. The model driven by experimental data shows an assessment lag and has difficulty dealing with unstructured information (such as operation mode adjustment, familial defects). In addition, the lack of model interpretability also limits its application in component-level differentiated maintenance.

In recent years, knowledge–data fusion methods have successfully achieved collaborative optimization of knowledge- and data-driven approaches by leveraging knowledge graph methods [20]. Nevertheless, the cost of constructing the graph for equipment increases nonlinearly with scale [21], and the high-dimensional semantic similarity of guidelines results in weak inference of abnormal operating conditions by graph neural networks.

2.2. Optimization Method of RAG

RAG technology is particularly suitable for scenarios in vertical fields with dense internal knowledge [10,22]. Current optimization methods for RAG typically focus on three aspects: index construction, semantic retrieval, and context generation.

Index construction primarily involves optimizing chunking [23], metadata [24], and hierarchical indexing [25]. Chunk granularity critically affects retrieval quality, requiring consideration of text length, embedding model properties, and task needs. Generation optimization improves the interaction between retrieved content and LLMs. Reranking strategies leverage position-sensitive features to enhance the visibility of key information [26], while context compression reduces noise to boost information density. Fine-tuning the generator enhances task-specific performance but demands substantial data and resources.

This work incorporates hierarchical indexing [27] with the iterative retrieval [28] methodology. Query optimization improves semantic matching via query expansion [29] and iterative retrieval, using context accumulation and termination prediction to help mitigate semantic gaps and dynamically adjust retrieval scope.

Although progress has been made, challenges remain in balancing retrieval accuracy, resource costs, and efficiency, especially in multidimensional semantic similarity and dynamic knowledge update optimization. Selecting optimal combinations of methods for specific tasks such as evaluating power equipment to maximize model performance warrants further study.

2.3. Evaluation of RAG

The evaluation of retrieval-augmented generation (RAG) systems presents unique challenges due to their hybrid architecture, which integrates multiple components, and their intricate interactions. Traditional evaluation benchmarks primarily focus on two core questions: whether the system retrieves the ground truth and whether it generates accurate, reliable output (i.e., the gold answer). For the retrieval stage, metrics such as precision, recall, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) are widely adopted to assess relevance and coverage of retrieved results [30]. In the generation stage, classical metrics like BLEU [31], ROUGE [32], and BERTScore [33] evaluate textual overlap and semantic alignment between generated responses and ground-truth references. Recent advancements leverage LLMs as evaluators, automating the assessment of context relevance, answer faithfulness, and factual consistency through prompt-based scoring (e.g., RAGAS [34], ARES [35]). These methods employ LLMs to quantify metrics such as faithfulness and context relevance. Beyond component-level evaluation, system-wide metrics address practical considerations, including computational efficiency, robustness against noisy or adversarial inputs, and safety [36,37,38].

Despite these advancements, existing evaluation methodologies exhibit critical gaps. Most benchmarks rely on generalized datasets, which inadequately represent domain-specific knowledge structures, temporal dynamics, or task-specific requirements. Furthermore, LLM-based evaluators, while efficient, often lack transparency in scoring criteria and struggle to capture nuanced human preferences in domain contexts. Therefore, for the specific task of selecting the most appropriate guidelines, it is imperative to design a domain-specific benchmark that accurately reflects the actual performance of the RAG model.

3. Method

3.1. Model Architecture

In this section, we introduce an improved retrieval-augmented generation model architecture, IHGR-RAG, which incorporates three key concepts: a query rewriting mechanism based on few-shot learning prompt engineering, an integration of hierarchical and global retrieval mechanisms, and a guideline matching mechanism. This architecture is particularly suitable for the end-to-end condition assessment of power equipment based on knowledge- and data-driven methods.

In terms of architecture, IHGR-RAG consists of four stages: the hierarchical retrieval stage, the global retrieval stage, the text generation stage, and the text alignment stage, as illustrated in Figure 2. The hierarchical retrieval stage comprises multiple modules, including a large language model (LLM) module for generating “pseudo-labels” and short-text classification labels, a guideline keywords index module, a guideline dynamic index module, and two retriever modules. These modules work collaboratively to establish a hierarchical retrieval strategy that encompasses both pre-retrieval and post-retrieval processes. The global retrieval stage consists of a comprehensive guideline index module and a retriever module. The text generation stage includes a text reranker module, and an LLM module for generating the most suitable guideline item that integrates the zero-shot chain-of-thought (CoT) approach [39]. The text alignment stage primarily consists of a text matcher module, which is utilized to achieve precise matching of guideline numbers.

3.2. Query Rewriting Mechanism Based on Few-Shot Learning Prompt Engineering

In most cases, the input of equipment failure feature descriptions as a query may be incomplete, particularly lacking information related to guidelines, such as components, functional elements, and failure modes. In general, a transformer guideline typically includes dozens of components and functional elements, with descriptions of their failure modes potentially reaching hundreds of items. Different types of equipment have distinct guidelines. Given the excessive complexity of the content, it is impractical to enumerate all possible fields in the prompt template.

To address this issue, we draw on the design concept of the HYDE model [40] and few-shot learning prompt engineering [41] to propose an improved query rewriting mechanism during the initial stage of the hierarchical retrieval stage.

The key advantage of this mechanism is that it does not require the allocation of model training resources. Specifically, it designs a prompt template that integrates few-shot learning as the input format for the LLM model, and its formula can be expressed as follows:

g (q, I N S T) = I n s t r u c t L M (q, I N S T)

(1)

where q denotes the input query,

I N S T

denotes the prompt and related instructions for few-shot learning, g denotes the content generated by the LLM, and

I n s t r u c t L M (\cdot)

denotes the response process of the LLM.

This mechanism enhances text information by constructing a dual-task collaborative template, specifically including two sub-modules based on the LLM: pseudo-label generation and short-text classification label optimization. On one hand, it guides the LLM to generate information about “hypothetical equipment types and components” by designing few-shot prompts, which are defined as “pseudo-labels”. Although the generated information may not be completely consistent with the ground truth, the semantic similarity can provide important context clues for downstream retrieval tasks. It is worth noting that, unlike the HYDE model which supplements information by generating lengthy “pseudo-documents” [40], the “pseudo-labels” method proposed in this section is more in line with the structured features of power equipment ledger information, and the generated content focuses on the local expansion of device entity relationships, thereby effectively avoiding the problem of information overload.

On the other hand, in view of the colloquial features of equipment failure descriptions and the standardized expressions of failure modes, there exist semantic heterogeneity and alignment difficulties between the two. This mechanism proposes a short-text classification label optimization method based on semantic space reconstruction. Firstly, a classification system is constructed based on expert knowledge, specifically using a hierarchical clustering strategy to summarize the failure mode expressions of various guidelines into 17 standard semantic clusters. A short-text classifier is constructed through an LLM, redefining the original one-to-one label matching task as the projection process of the input text to the standard semantic space. Although these 17 subspaces cannot achieve precise matching, this method enhances the semantic representation of queries more effectively than having LLMs generate fault modes without reference.

Finally, the generated “pseudo-labels”, the mapped short-text classification labels and original query are concatenated, enabling the original query to be rewritten as “pseudo keyword text”. Figure A1 in Appendix A provides a complete prompt template for the query rewriting mechanism. The equation is presented as follows:

q^{*} = C o n c a t (g, c, q)

(2)

where

q^{*}

denotes the rewritten query, g is the same as mentioned earlier, referring to the “pseudo-labels” generated by the LLM, c denotes the best mapping result of the short-text classifier, and q denotes the original query.

C o n c a t (\cdot)

refers to the text concatenation process.

3.3. Integration of Hierarchical and Global Retrieval Mechanisms

The integration of hierarchical and global retrieval mechanisms corresponds to the hierarchical retrieval stage and global retrieval stage of the IHGR-RAG framework, as illustrated in Figure 2.

The hierarchical architecture during the retrieval stage encompasses both the stratification of retrieval procedures and the hierarchical organization of content features, which can be further categorized into pre-retrieval and post-retrieval processes. In the pre-retrieval process, only the “keyword text” from the guideline items is retrieved. The construction method of the guideline keyword index involves concatenating the ground-truth information into a comprehensive string formatted as “type-component-functional element-failure mode”, followed by vectorization and storage.

The main steps of the hierarchical retrieval stage include the following: (1) The first retriever takes the rewritten query text as input, processes it through vectorization, and conducts a dense retrieval method in the guideline keyword index to obtain the keyword chunks related to the query. (2) Based on the top

k_{1}

keyword chunks retrieved in the first step, a sparse retrieval method is conducted in the complete guideline document to filter out

k_{1}^{*}

guideline items corresponding to these keyword chunks

(k_{1}^{*} ⩾ k_{1})

. Subsequently, these

k_{1}^{*}

guideline items, which include deduction criteria, are segmented and embedded at the sentence level to construct a dynamic index. Note that the generation of the guideline dynamic index is entirely dependent on the keyword chunks retrieved in the first step, so each query forms a unique dynamic index, which is why it is called “dynamic”. The first and second steps together constitute the pre-retrieval process. (3) The post-retrieval process is centered around the guideline dynamic index and the input query. The second retriever vectorizes the original query content and conducts a dense retrieval method in the guideline dynamic index to retrieve the top

k_{2}

most relevant chunks.

The dense retrieval method can be represented by

R e l (q, d)

, and the equation is as follows:

R e l (q, d) = f_{sim} (ϕ (q), ψ (d))

(3)

where q denotes the input query text, and d denotes the input corpus for index construction.

ϕ (\cdot) \in R^{l}

and

ψ (\cdot) \in R^{l}

are functions that map the query and index to l-dimensional vectors, respectively [42]. These two functions, serving as text embedding models, can be the same.

f_{sim} (\cdot)

represents the similarity measurement function, which is as follows:

f_{sim} (\cdot) = {argmin}_{i} ∥x - x_{i}∥

(4)

where

∥ \cdot ∥

is the Euclidean distance

(L^{2})

, and x and

x_{i}

represent the two input text embedding vectors.

To alleviate the local convergence problem that may be caused by the pre-retrieval process at the hierarchical retrieval stage, this paper further introduces a global retrieval stage based on the full context information of the query as a supplement. The global retrieval stage adopts a conventional RAG retrieval method, in which the entire query is embedded, the guidelines are segmented by individual items, and a retriever is ultimately leveraged to obtain the top

k_{3}

chunks.

3.4. Guideline Matching Mechanism

Note that the chunks obtained through the global retrieval stage have no necessary connection with those obtained through the hierarchical retrieval stage. These two groups may be partially the same, completely the same, or completely different. In addition, the chunks obtained through intensive retrieval have not been sorted on the basis of their similarity. In order to integrate and unify the results of global and hierarchical retrieval, a text reranker module is introduced in the text generation stage to sort all candidate chunks obtained from both global and hierarchical retrieval and to preferentially select chunks with higher similarity, thereby enhancing the accuracy of the LLM in generating unique guidelines.

The guideline matching mechanism that combines reranking and CoT corresponds to the text generation stage and text alignment stage of the IHGR-RAG model, as shown in Figure 2.

During the text generation stage, the prompt module integrates the query and the results of the text reranker by constructing a new prompt template c. The processing formula c of the prompt template can be expressed as:

c = Concat (q, D, I N S T)

(5)

where q denotes the input query, D denotes the set of the top

k_{4}

chunks that are selected by the reranker,

I N S T

represents the user instruction that incorporates the CoT prompt, and

C o n c a t (\cdot)

indicates the process of text concatenation.

CoT significantly enhances the performance of LLMs in complex reasoning tasks by requiring the model to explicitly generate intermediate reasoning steps before outputting the final answer, thereby breaking down complex problems into a series of sub-problems that are solved step by step [41]. A complete prompt template that integrates CoT typically consists of three parts: instruction, rationale, and exemplars. As candidate chunks occupy a large number of characters in the prompt template, to ensure the focus of the LLM, we adopt the zero-shot CoT approach without providing specific examples.

For the extreme scenario where the input query does not match any of the guideline items, we introduce a dual-constraint mechanism in the CoT instruction, forcing the model to disclose the complete reasoning path and automatically triggering a manual intervention message when the model cannot match a unique guideline, outputting “Please transfer to manual processing”. This achieves an objective determination of “correct non-matching” while ensuring the accuracy of the generated content.

The generator module takes the above prompt template as the input of the LLM and constructs the objective function of the generation model as:

p (y ∣ c) = \frac{e^{s (c, y)}}{\sum_{y^{'} \in Y} e^{s (c, y^{'})}}

(6)

where the prompt template c represents all the content input into the LLM,

p (y ∣ c)

denotes the probability of generating text y given the prompt c, Y represents the set of all possible output texts,

y^{'}

indicates the context information before generating text y, and s represents the similarity score between the prompt c and the generated text y.

The LLM primarily outputs three components: (1) deduction content, a string that combines guideline keywords and deduction criteria, (2) the corresponding deduction value—these two elements together constitute a complete guideline item—and (3) the reasoning process. Figure A2 in Appendix A provides a complete prompt template for the guideline matching mechanism. In the text alignment module, we have introduced a string detection mechanism. When the output deduction content does not fully match the guideline item, the Longest Common Substring (LCS) [43] method is adopted to match the most relevant deduction content in the candidate chunks, thereby ensuring the consistency of the output content with the guideline item.

4. Experiments

4.1. Model Parameter Settings and Dataset

The IHGR-RAG framework achieves multi-model compatibility through a modular design, and its core architecture supports the flexible combination of mainstream text embedding models and an LLM, where the LLM can be either closed-source or open-source, regardless of whether it has been fine-tuned or not. Considering the security characteristics of data in the power sector, this study built an offline deployment environment using open-source models for performance evaluation. The vector index was built based on the FAISS database [44] to achieve efficient similarity retrieval. The computing environment was deployed on a 4×NVIDIA 3090 (24 GB VRAM) GPU cluster, with the programming environment being Python 3.9, and the complete chain was built by integrating the Pytorch deep learning platform and the LangChain 0.1.9 development framework to ensure the reproducibility of the experiments.

To optimize resource utilization, the Qwen2.5-14B-Instruct model was employed as the baseline generative LLM in both stages of the processing pipeline. Greedy decoding was utilized to minimize output uncertainty. Specifically, the baseline embedding model and reranking model were designated as bge-m3 and bce-reranker-base-v1, respectively. In the initial experiments, the top-k retrieval parameters were configured as follows: top

k_{1}

was set to seven, while the other parameters, including top

k_{2}

, top

k_{3}

, and top

k_{4}

, were all set to five.

The experimental dataset architecture consisted of three core components: (1) a guideline knowledge base serving as the index source; (2) a collection text of equipment failure features as queries; (3) manually annotated gold standard labels for model performance evaluation.

Among them, the construction of the knowledge base was based on a coding system to classify the guidelines, where each code number corresponded to a guideline item. The text of equipment failure feature descriptions usually comes from defect reports, maintenance reports, and test reports, as well as system alarm information. The content format is not fixed, but generally includes key information such as the substation name, equipment name and its dispatch number. We initially selected 280 transformer failure feature descriptions from daily reports and alarm records and manually annotated them to construct the Transformer Condition Assessment Dataset (TCAD), which consisted of 280 input queries and their corresponding guideline items, ensuring high annotation accuracy and data diversity. In addition, classification labels were utilized to differentiate positive samples (label 1) that corresponded to specific guideline items from negative samples (label 2) that lacked such a correspondence. Given the limited total sample size, we set the positive-to-negative sample ratio at

70 %

:

30 %

to ensure an adequate representation of negative instances.

To enhance the statistical significance of the experiment, we selected four types of equipment, including transformers, SF6 circuit breakers, Gas Insulated Switchgear (GIS), and Capacitive Voltage Transformers (CVT), to construct the Power Equipment Condition Assessment Dataset (PECAD). This dataset comprised 1514 input queries and their corresponding guideline items, with the sample proportions for each equipment type detailed in Table 1. The dataset was constructed using LLM-based data augmentation, grounded in established guidelines and a small set of real-world daily reports and alarm records, to generate semantically similar input queries. Half of the dataset was randomly selected for manual verification, revealing a

91 %

consistency rate between the automatically generated query–guideline pairs and the manually annotated references. The dataset encompassed a total of 542 distinct guideline items across all four equipment types. To illustrate the proposed method’s sensitivity to guideline formatting, we modified a subset of SF6 circuit breaker guideline items by removing Functional Elements or Failure Models information, in alignment with the actual operational characteristics of the corresponding guideline items. For example, SF6 circuit breaker—operating mechanism—opening and closing circuit/(The failure mode is absent here)—opening and closing coil burned out. Given the sufficient size of the PECAD, the positive-to-negative sample ratio was set at

80 %

:

20 %

to ensure balanced representation of both classes. An illustration of model input and output is provided in Appendix A, as shown in Figure A3.

4.2. Evaluation Benchmark

To accurately assess whether the model could provide the unique correct guideline item associated with the query, we improved the RAG end-to-end evaluation benchmark and proposed an independent evaluation benchmark specifically for the dynamic assessment of power equipment condition. This benchmark comprehensively evaluates the model’s performance by separately assessing the retrieval stage and the generation stage. It employs an exact match mechanism based on guideline code numbers. A correct match is determined only when the code number, derived from the output deduction content, exactly matches the guideline code number.

When evaluating the performance of the model retrieval stage, we selected three variables,

R e t r i e v a l P r e c i s i o n @ k

,

M R R

, and

R e t r i e v a l N D C G @ k

, as the main evaluation indicators. Generally, the variable

P r e c i s i o n @ k

utilized to evaluate the retrieval performance of a model indicates that for each query, there may be multiple positive examples among the top-k retrieval results. For instance, if there are three positive examples among the top-five retrieval results, the variable

P r e c i s i o n @ 5

should be calculated as 0.6. However, we required a single exact retrieval result conforming to the annotation guideline. Therefore, the traditional variable

P r e c i s i o n @ k

was not applicable to the current scenario. In contrast, the variable

R e t r i e v a l P r e c i s i o n @ k

we propose was specifically designed to assess whether the retrieval result corresponding to each query contained a unique correct match to the annotation guideline. Its calculation formula is as follows:

R e t r i e v a l P r e c i s i o n @ k = \frac{N u m b e r o f q u e r y @ k}{| Q_{1} |}

(7)

where

|Q_{1}|

represents the total number of all queries in the dataset with a classification label of one, the same as the number of positive samples.

N u m b e r o f q u e r y @ k

indicates the number of queries among the aforementioned queries for which the top-k retrieval results after reranking contain the unique correct guideline, regardless of the specific order of this guideline within the top-k results.

The variable

M R R

was used to measure the ranking of the correct chunk among all retrieval results for a query within the top-k chunks. The formula for its calculation is as follows:

M R R = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{r a n k_{i}}

(8)

where

|Q|

represents the total number of queries, and

r a n k_{i}

indicates the rank position of the first correct retrieval result in the i-

t h

query.

The variable

R e t r i e v a l N D C G

was used to measure the ranking position of the correct retrieval result among all the retrieval results for each query. Since in the model evaluation task, only matching the unique correct guideline was of practical significance, the setting logic of the relevance score

r e l_{i}

was significantly different from the traditional

I D C G

variable. The formula of the traditional

I D C G @ k

can be expressed as:

I D C G @ k = \sum_{i = 1}^{| R E L |} \frac{r e l_{i}}{{log}_{2} (i + 1)}

(9)

where

r e l_{i}

represents the relevance score of the ith retrieval result, usually an integer ranging from zero to nine;

| R E L |

indicates the number of sets formed among the top-k retrieval results when they are sorted in descending order of their true relevance. Specifically, in the dataset, if the i-

t h

retrieval result of the qth query matches the labeled guideline (

i \in k

), the relevance score of this retrieval result is set to one; otherwise, the relevance scores of all other retrieval results are set to zero. The formula for

I D C G @ k

in our benchmark can be expressed as:

I D C G @ k = \sum_{i = 1}^{[R E L]} \frac{1}{{log}_{2} (i + 1)} = 1

(10)

Since

I D C G @ k = 1

, the formulas of

D C G @ k

and

N D C G @ k

can be expressed as:

D C G @ k = \sum_{i = 1}^{k} \frac{1}{{log}_{2} (i + 1)}

(11)

N D C G @ k = \frac{D C G @ k}{I D C G @ k} = D C G @ k

(12)

When the chunk that matches the correct guideline is retrieved but does not appear in the top-k retrieval results (

i \notin k

), it indicates that the value of variable

D C G @ k

is zero. The

R e t r i e v a l N D C G @ k

represents the average

N D C G

of all positive samples. The larger its value, the higher the ranking of the unique correct retrieval result for most input positive samples. The formula for

R e t r i e v a l N D C G @ k

can be expressed as:

R e t r i e v a l N D C G @ k = \frac{1}{| Q_{1} |} \sum_{q = 1}^{| Q_{1} |} N D C G @ k

(13)

where

| Q_{1} |

indicates the total number of positive samples.

On the other hand, in the text generation stage, we propose to adopt Negative Rejection (NR), Generated Content Accuracy (GCA), and Generated Scores’ Accuracy (GSA) as the evaluation metrics, which are denoted by the variables

N R

,

G C A

, and

G S A

, respectively.

As mentioned earlier, an LLM may generate the complete content of a guideline or a negative response of “Please transfer to manual processing”. For query samples classified as label 2, since they have been manually labeled as having no correct associated guideline, we expect the LLM to output a negative response. NR was employed to evaluate the LLM’s ability to “correctly reject”, and its formula is as follows:

N R = \frac{T_{N C}}{|Q_{2}|}

(14)

where

| Q_{2} |

denotes the number of queries in the dataset with a classification label of 2, the same as the total number of negative samples.

T_{N C}

indicates the number of queries for which the LLM generates the “correctly reject” response.

For the results of LLM generation, the generated content may not be entirely accurate.

G C A

was utilized to assess the correctness of exact matches between the output deduction content and the annotated deduction content in the dataset. Its calculation formula is as follows:

G C A = \frac{T_{P C} + T_{N C}}{| Q |}

(15)

where

T_{P C}

represents the number of queries for which the LLM correctly generates the deduction content.

In addition, some operation and maintenance personnel may only focus on the deduction values in the equipment condition assessment and use them as the basis for determining the current condition level of the equipment, without necessarily paying attention to the specific deduction content. In such cases, even if the deduction content generated by the LLM is incorrect, the deduction values it outputs may still be consistent with the correct deduction values associated with the guideline. Therefore, it is common for

G S A

to be higher than

G C A

. The formula for

G S A

is as follows:

G S A = \frac{T_{P S} + T_{N S}}{| Q |}

(16)

where

T_{P S}

represents the number of queries for which the LLM correctly generates the deduction value among positive samples.

T_{N S}

represents the number of queries for which the LLM correctly generates the zero deduction value among negative samples, which is numerically consistent with

T_{N C}

.

4.3. Experimental Results

We first performed comparative experiments to validate the necessity of the IHGR-RAG model and to assess its potential performance benefits. The experiments specifically evaluated three distinct retrieval strategies: (1) individual global retrieval approach (Global-RAG), which is functionally equivalent to the conventional RAG; (2) individual hierarchical retrieval approach (Hierarchical-RAG); and (3) integrated hierarchical and global retrieval approach (IHGR-RAG). To ensure experimental fairness, the text generation and text alignment stages remained identical across all three strategies. The sole variable was the specific methodology employed in the retrieval stage. Furthermore, the retrieval process for each strategy utilized reranked text chunks as inputs.

The results of the retrieval stage experiments on the TCAD dataset are presented in Table 2. The IHGR-RAG model proposed in this paper demonstrated superior performance compared to the individual global and hierarchical retrieval models across key indicators of the retrieval stage. A larger pool of candidate chunks increases the likelihood of including the uniquely correct guideline item. Although IHGR-RAG did not achieve the highest performance in terms of

R e t r i e v a l P r e c i s i o n @ 1

and

R e t r i e v a l P r e c i s i o n @ 3

, as well as

R e t r i e v a l N D C G @ 1

and

R e t r i e v a l N D C G @ 3

, we indeed passed five candidate chunks from the retrieval stage to the generation stage. Therefore, our primary focus was on evaluating the performance of these metrics at

R e t r i e v a l P r e c i s i o n @ 5

. Specifically, the

R e t r i e v a l P r e c i s i o n @ 5

for IHGR-RAG was 0.9612, indicating that

96.12 %

of the positive query samples successfully identified the correct guideline among the top-five candidate guidelines. This represents an increase of

4.74 %

and

11.64 %

over the performance of solely retrieval models, respectively. When considering the ranking, for the values of

M R R

and

R e t r i e v a l N D C G @ 5

, IHGR-RAG had slightly improved performance over the other two models. The reason why IHGR-RAG outperformed the global approach in terms of

R e t r i e v a l N D C G @ 5

is that it incorporates hierarchical results of candidate chunks, thereby avoiding overfitting of the global mechanism and enhancing result diversity. This diversity proves to be highly beneficial for improving the performance of metrics at the generation stage.

At the generation stage, the

G C A

value of IHGR-RAG was 0.8276, indicating an

82.76 %

probability of accurately matching the sole correct guideline item for all positive and negative samples, as shown in Table 3. That probability was

4.14 %

and

5.52 %

higher than that of Global-RAG and Hierarchical-RAG, respectively. For

G S A

, which only considers the score, the value for IHGR-RAG was also

2.76 %

and

4.14 %

higher than that for the other two models, respectively. In terms of

N R

, it can be observed that Hierarchical-RAG significantly outperformed both Global-RAG and IHGR-RAG. This indicates that the hierarchical retrieval mechanism is more effective at identifying results that are incorrect, without erroneously matching guidelines or imposing unnecessary deductions on values that should not be deducted. The candidate chunks input to the generation stage are reranked by integrating the results of both hierarchical and global retrieval mechanisms, ensuring that the most relevant candidates are selected. The reranking principle involves comparing each candidate chunk with the input query, which results in a higher overall similarity between the candidate chunks and the query. Although this increases the likelihood of misclassifying negative samples during the generation stage, thereby compromising

N R

performance, it is important to note that this strategy significantly benefits the ultimate objective of accurately identifying the unique correct guideline item.

Second, the comparative experimental results of the three retrieval mechanisms on the PECAD dataset are presented in Table 4 and Table 5. The execution times of the global, hierarchical, and hybrid retrieval methods were 30,788 s, 29,954 s, and 31,088 s, respectively. Inference latency was averaged at approximately 20 s, with no notable differences among the methods. In general, the global retrieval mechanism achieved superior results in terms of metrics

R e t r i e v a l P r e c i s i o n @ 1

and

R e t r i e v a l P r e c i s i o n @ 3

but performed suboptimally in the generation stage. Since the retrieval stage metrics do not account for negative samples, although the global retrieval mechanism slightly outperformed the hybrid retrieval mechanism in

R e t r i e v a l N D C G @ 5

for positive samples, it was generally inferior to the hybrid mechanism when negative samples were considered in the generation stage metrics.

The hierarchical retrieval mechanism demonstrated the poorest performance at the retrieval stage on this dataset. However, it improved the

N R

metric by

7.59 %

and

9.57 %

compared to the other two mechanisms, respectively. The underlying reason for this phenomenon lies in the similarity between the component information described in the GIS guideline and the types of various power equipment. For instance, GIS components include circuit breakers, disconnectors, grounding disconnectors, current transformers, voltage transformers, and surge arresters. These components were easily matched with the guidelines of other equipment types during the hierarchical retrieval stage. Nevertheless, the hierarchical retrieval mechanism still exhibited strong performance for other equipment guidelines, particularly when functional elements or failure modes were missing in certain items of the SF6 circuit breaker guideline.

In the face of complex variations in guidelines, the hybrid retrieval mechanism remained the most effective in terms of

R e t r i e v a l P r e c i s i o n @ 5

, which contributed to improved performance at the generation stage. The hybrid mechanism improved

G C A

performance by

6.01 %

and

7.93 %

compared to the global and hierarchical mechanisms, respectively, and

G S A

performance by

4.03 %

and

7.66 %

. Regarding

N R

, the hybrid retrieval mechanism benefited from the hierarchical mechanism’s contribution to the diversity of candidate texts and ultimately outperformed the global retrieval mechanism on the PECAD dataset.

Third, we examined the impact of the number of candidate chunks on the objective of matching the sole correct guideline. We selected two variables on the TCAD dataset,

k_{2}

and

k_{3}

, with various combinations of values ranging from four to six, and analyzed the performance of the IHGR-RAG model with respect to

G S A

and

G C A

. As can be seen from Figure 3, when the values of

k_{2}

and

k_{3}

were close to four, the performance of

G S A

and

G C A

was superior to that of other value ranges. Compared with

k_{2}

, when the value of

k_{3}

was smaller, the model achieved better performance. This indicates that a larger number of candidate global retrieval chunks may lead to greater confusion for the model.

Fourth, we investigated the impact of various LLMs on the retrieval and generation outcomes of the IHGR-RAG framework on the TCAD dataset. As demonstrated in Figure 4, the IHGR-RAG model that employed Qwen2.5-14B-Instruct as its LLM marginally outperformed its counterparts employing the other two LLMs across the three evaluated retrieval metrics:

R e t r i e v a l P r e c i s i o n @ 5

,

M R R

, and

R e t r i e v a l N D C G @ 5

. The IHGR-RAG model that utilized Qwen2.5-14B-Instruct significantly outperformed the other two scenarios in terms of both

G S A

and

G C A

values, achieving an improvement of at least

25 %

. In contrast, the model employing Baichuan2-13B-Chat as the LLM demonstrated the best

N R

performance but exhibited the poorest performance in terms of

G C A

and

G S A

. We attribute this discrepancy to the more cautious judgment strategy adopted by this LLM model. In addition, based on the comparative efficiency data presented in Table 6, Qwen2.5-14B-Instruct was selected primarily due to its optimal balance between inference latency and model capability.

Finally, we designed ablation experiments on the TCAD dataset to separately verify whether each of the improvements we proposed had a positive effect on model performance. Each value of the model ran three times and passed the Student’s t-test (p < 0.05). Due to the implementation of greedy decoding in the LLM, the output remained consistent across all inference rounds. In Table 7, the IHGR-RAG model slightly outperformed the other models in terms of the two retrieval metrics:

R e t r i e v a l P r e c i s i o n @ 5

and

R e t r i e v a l N D C G @ 5

. Regarding the MRR metric, the performance of our proposed model was also above average. We observed that the reranker method significantly enhanced the model’s performance. For the more critical

G C A

and

G S A

metrics in Table 8, our proposed model outperformed those in ablation experiment scenarios, thereby validating the effectiveness of our proposed improvements. Specifically, for the

G S A

metric, the performance improvement of IHGR-RAG over other models ranged from

0.69 %

to

9.66 %

. For the

G C A

metric, the improvement of IHGR-RAG over other models reached from

1.38 %

to

14.14 %

.

5. Conclusions

An end-to-end power equipment condition assessment framework based on knowledge and data driven method was proposed in this article. By dynamically correlating operational records and alarm data with corresponding guideline items, the framework enabled adaptive adjustment of the equipment health index. The core innovation lay in the development of an IHGR-RAG model. At the retrieval stage, the hierarchical approach achieved fine-grained precision through a two-stage process: (1) a pre-retrieval stage utilizing few-shot prompt engineering to generate pseudo-labels, and (2) a post-retrieval stage that constructed a dynamic index to address missing critical information related to the input query. The global retrieval approach supplemented this by mitigating overfitting risks inherent in hierarchical approaches. During generation, a reranker consolidated multi-source retrieval results, while a zero-shot CoT prompting strategy combined with an LCS algorithm enhanced guideline matching accuracy without requiring additional training.

To validate the model, a Task-Specific Quantitative Evaluation Benchmark was established with manually annotated datasets. Experimental results demonstrated that the proposed model outperformed conventional RAG and standalone hierarchical retrieval methods, achieving accuracy improvements of

4.14 %

and

5.12 %

, respectively. Ablation studies confirmed the efficacy of each modular enhancement.

This work addresses the critical challenge of limited annotated data in vertical domains by providing an optimization strategy without fine-tuning LLMs. Future research could explore (1) leveraging LLMs for accurately automating dataset annotation to conduct generalization performance evaluation and (2) integrating knowledge-graph applications and drawing on the improvement strategies from related GraphRAG and AI agent models [45,46], while continuously refining our proposed model to enhance its practical effectiveness in vertical domain applications.

Author Contributions

Conceptualization, Z.Y. and D.Q.; methodology, Z.Y.; software, Z.Y.; validation, Z.Y. and H.L.; formal analysis, Z.Y. and S.Z.; investigation, Z.Y.; resources, D.Q.; data curation, Z.Y. and H.L.; writing—original draft preparation, Z.Y.; writing—review and editing, Z.Y. and S.Z.; visualization, H.L.; supervision, D.Q.; project administration, Z.Y. and D.Q.; funding acquisition, D.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. U1909201, 62101490, 6212780029, 52467024, 62476242, and 2022C01056), by the Hainan Provincial Sanya Yazhou Bay Science and Technology Innovation Joint Project (Grant No. ZDYF2025GXJS142), by the National Natural Science Foundation of Zhejiang Province (Grant No. LQ21F030017, and 2024C01033), and by the Hainan Provincial Natural Science Foundation of China (Grant No. 525MS108).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank Hainan Institute of Zhejiang University for their great guidance and help.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

RAG	Retrieval-Augmented Generation
LLMs	Large Language Models
LLM	Large Language Model
HI	Health Index
CoT	Chain-of-Thought
MRR	Mean Reciprocal Rank
DCG	Discounted Cumulative Gain
IDCG	Ideal Discounted Cumulative Gain
NDCG	Normalize Discounted Cumulative Gain
NR	Negative Rejection
GCA	Generated Content Accuracy
GSA	Generated Scores Accuracy
GIS	Gas Insulated Switchgear
CVT	Capacitive Voltage Transformers
LCS	Longest Common Substring
TCAD	Transformer Condition Assessment Dataset
PECAD	Power Equipment Condition Assessment Dataset

Appendix A

Appendix A.1

Table A1. An example of transformer guideline format.

Components	Functional Elements	Failure Modes	Deduction Values	Deduction Criteria
Main body	Oil tank	Seal failure	12	Slight oil leakage (no more than 12 drops per minute)
			24	Dripping oil (12 drops or more per minute, without forming oil flow)
			30	Oil spraying or forming oil flow
		…	…	…
	Desiccant breather	Pipeline blockage	6	Heater/electronics failure in maintenance-free breather
		Color changed	6	Atmospheric air entering, >2/3 desiccant changed color
		…	…	…
	…	…	…	…
Bushings	…	…	…	…
…	…	…	…	…

Figure A1. The complete prompt template for the query rewriting mechanism. The symbols such as * and ** serve to emphasize and separate content in the prompt.

Figure A2. The complete prompt template for the guideline matching mechanism.

Figure A3. An illustration of model input and output.

References

Li, S.; Li, J. Condition monitoring and diagnosis of power equipment: Review and prospective. High Volt. 2017, 2, 82–91. [Google Scholar] [CrossRef]
Guo, H.; Guo, L. Health index for power transformer condition assessment based on operation history and test data. Energy Rep. 2022, 8, 9038–9045. [Google Scholar] [CrossRef]
Jiang, J.; Sun, W.; Luo, Y.; Liu, J.; E., S.; Liang, Y. A Deep Semantic Based Text Mining Technique for Power Equipment Defects for Power Equipment Condition Evaluation Technique. In Proceedings of the 2024 4th International Conference on New Energy and Power Engineering (ICNEPE), Guangzhou, China, 8–10 November 2024; IEEE: New York, NY, USA, 2024; pp. 550–559. [Google Scholar]
Tamma, W.R.; Prasojo, R.A.; Suwarno. High voltage power transformer condition assessment considering the health index value and its decreasing rate. High Volt. 2021, 6, 314–327. [Google Scholar] [CrossRef]
Wen, B.; Wen, A.; Fang, W.; Li, J. LLM-Enhanced Survival Model for Electric Device Lifespan Estimation. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Nadi, Fiji, 2–7 December 2024; IEEE: New York, NY, USA, 2024; pp. 2547–2552. [Google Scholar]
Majumder, S.; Dong, L.; Doudi, F.; Cai, Y.; Tian, C.; Kalathil, D.; Ding, K.; Thatte, A.A.; Li, N.; Xie, L. Exploring the capabilities and limitations of large language models in the electric energy sector. Joule 2024, 8, 1544–1549. [Google Scholar] [CrossRef]
Azamfirei, R.; Kudchadkar, S.R.; Fackler, J. Large language models and the perils of their hallucinations. Crit. Care 2023, 27, 120. [Google Scholar] [CrossRef] [PubMed]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval augmentation reduces hallucination in conversation. arXiv 2021, arXiv:2104.07567. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Yang, X.; Sun, K.; Xin, H.; Sun, Y.; Bhalla, N.; Chen, X.; Choudhary, S.; Gui, R.; Jiang, Z.; Jiang, Z.; et al. Crag-comprehensive rag benchmark. Adv. Neural Inf. Process. Syst. 2024, 37, 10470–10490. [Google Scholar]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17754–17762. [Google Scholar]
Li, S.; Li, X.; Cui, Y.; Li, H. Review of transformer health index from the perspective of survivability and condition assessment. Electronics 2023, 12, 2407. [Google Scholar] [CrossRef]
Bohatyrewicz, P.; Płowucha, J.; Subocz, J. Condition assessment of power transformers based on health index value. Appl. Sci. 2019, 9, 4877. [Google Scholar] [CrossRef]
Wang, Y.; Liu, H.; Bi, J.; Wang, F.; Yan, C.; Zhu, T. An approach for Condition Based Maintenance strategy optimization oriented to multi-source data. Clust. Comput. 2016, 19, 1951–1962. [Google Scholar] [CrossRef]
Zhou, D.; Zhang, X.; Zou, Y.; Ni, Y.; Wang, D. State Evaluation Model of Distribution Transformer Based on Analytic Hierarchy Process. In Proceedings of the 2021 International Conference on Sensing, Measurement & Data Analytics in the era of Artificial Intelligence (ICSMD), Nanjing, China, 21–23 October 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Zhang, D.; Chu, Z.; Gui, Q.; Wu, F.; Yang, H.; Ma, Y.; Tao, W. Transformer maintenance decision based on condition monitoring and fuzzy probability hybrid reliability assessment. IET Gener. Transm. Distrib. 2023, 17, 976–992. [Google Scholar] [CrossRef]
Wang, L.; Chi, J.; Ding, Y.; Yao, H.; Guo, Q.; Yang, H. Transformer fault diagnosis method based on SMOTE and NGO-GBDT. Sci. Rep. 2024, 14, 7179. [Google Scholar] [CrossRef] [PubMed]
Xing, Z.; He, Y.; Chen, J.; Wang, X.; Du, B. Health evaluation of power transformer using deep learning neural network. Electr. Power Syst. Res. 2023, 215, 109016. [Google Scholar] [CrossRef]
Chen, W.; Zai, H.; He, H.; Zhang, K.; Xi, R.; Fu, F. Research on fault diagnosis method of power transformer based on graph neural network. In Proceedings of the 2021 IEEE 5th Conference on Energy Internet and Energy System Integration (EI2), Taiyuan, China, 22–24 October 2021; IEEE: New York, NY, USA, 2021; pp. 4289–4294. [Google Scholar]
Zhong, L.; Wu, J.; Li, Q.; Peng, H.; Wu, X. A comprehensive survey on automatic knowledge graph construction. ACM Comput. Surv. 2023, 56, 1–62. [Google Scholar] [CrossRef]
Chen, L.C.; Pardeshi, M.S.; Liao, Y.X.; Pai, K.C. Application of retrieval-augmented generation for interactive industrial knowledge management via a large language model. Comput. Stand. Interfaces 2025, 94, 103995. [Google Scholar] [CrossRef]
Bao, Z.; Li, C.; Wang, R. Chunk-based chinese spelling check with global optimization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 2031–2040. [Google Scholar]
Alon, U.; Xu, F.; He, J.; Sengupta, S.; Roth, D.; Neubig, G. Neuro-symbolic language modeling with automaton-augmented retrieval. In Proceedings of the International Conference on Machine Learning (PMLR), Baltimore, MD, USA, 17–23 July 2022; pp. 468–485. [Google Scholar]
Bolshakova, E.I.; Ivanov, K.M. Automating Hierarchical Subject Index Construction for Scientific Documents. In Proceedings of the Artificial Intelligence: 18th Russian Conference, RCAI 2020, Moscow, Russia, 10–16 October 2020; Proceedings 18. Springer: Berlin/Heidelberg, Germany, 2020; pp. 201–214. [Google Scholar]
Guo, J.; Fan, Y.; Pang, L.; Yang, L.; Ai, Q.; Zamani, H.; Wu, C.; Croft, W.B.; Cheng, X. A deep look into neural ranking models for information retrieval. Inf. Process. Manag. 2020, 57, 102067. [Google Scholar] [CrossRef]
Wang, C.; Qin, Z.G.; Yang, L.; Wang, J. A fast duplicate chunk identifying method based on hierarchical indexing structure. In Proceedings of the 2012 International Conference on Industrial Control and Electronics Engineering, Xi’an, China, 23–25 August 2012; IEEE: New York, NY, USA, 2012; pp. 624–627. [Google Scholar]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning, (PMLR), Baltimore, MD, USA, 17–23 July 2022; pp. 2206–2240. [Google Scholar]
Kumar, R.; Sharma, S. Hybrid optimization and ontology-based semantic model for efficient text-based information retrieval. J. Supercomput. 2023, 79, 2251–2280. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of retrieval-augmented generation: A survey. In Proceedings of the CCF Conference on Big Data, Qingdao, China, 9–11 August 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 102–120. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2022; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004; pp. 74–81. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Es, S.; James, J.; Anke, L.E.; Schockaert, S. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; pp. 150–158. [Google Scholar]
Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 1: Long Papers, pp. 338–354. [Google Scholar]
Hofstätter, S.; Chen, J.; Raman, K.; Zamani, H. Fid-light: Efficient and effective retrieval-augmented text generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 1437–1447. [Google Scholar]
Lyu, Y.; Li, Z.; Niu, S.; Xiong, F.; Tang, B.; Wang, W.; Wu, H.; Liu, H.; Xu, T.; Chen, E. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. Acm Trans. Inf. Syst. 2025, 43, 1–32. [Google Scholar] [CrossRef]
Zeng, S.; Zhang, J.; He, P.; Liu, Y.; Xing, Y.; Xu, H.; Ren, J.; Chang, Y.; Wang, S.; Yin, D.; et al. The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG). In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 4505–4524. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers, pp. 1762–1777. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Zhao, W.X.; Liu, J.; Ren, R.; Wen, J.R. Dense text retrieval based on pretrained language models: A survey. ACM Trans. Inf. Syst. 2024, 42, 1–60. [Google Scholar] [CrossRef]
Amir, A.; Charalampopoulos, P.; Pissis, S.P.; Radoszewski, J. Dynamic and internal longest common substring. Algorithmica 2020, 82, 3707–3743. [Google Scholar] [CrossRef]
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
Rani, M.; Mishra, B.K.; Thakker, D.; Khan, M.N. To Enhance Graph-Based Retrieval-Augmented Generation (RAG) with Robust Retrieval Techniques. In Proceedings of the 2024 18th International Conference on Open Source Systems and Technologies (ICOSST), Lahore, Pakistan, 26–27 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Huang, Y.; Zhang, S.; Xiao, X. Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, Toronto, ON, Canada, 3–7 August 2025; pp. 1003–1012. [Google Scholar]

Figure 1. An example of the input text and its most relevant guideline (the ideal output result). The lines in the figure express that there are many guidelines similar to the target guideline, and each guideline consists of multiple fields.

Figure 2. Model architecture of IHGR-RAG.

Figure 3. A comparison of the impact of the number of candidate chunks on the matching mission. (a)

G C A

values within the IHGR-RAG model. (b)

G S A

values within the IHGR-RAG model.

Figure 3. A comparison of the impact of the number of candidate chunks on the matching mission. (a)

G C A

values within the IHGR-RAG model. (b)

G S A

values within the IHGR-RAG model.

Figure 4. A comparison of the impact of various LLMs on the retrieval and generation outcomes of the IHGR-RAG framework.

Table 1. The number of positive samples and guideline items for the four types of power equipment.

Equipment Type	Positive Samples Size	Guideline Items Size
Transformer	729	180
SF6 circuit breaker	252	119
Gas Insulated Switchgear (GIS)	168	197
Capacitive Voltage Transformer (CVT)	62	46

Table 2. A comparative analysis of the performance of three retrieval strategies during the retrieval stage on the TCAD dataset.

Model	Retrieval Precision			MRR	Retrieval NDCG
Model	@1	@3	@5	MRR	@1	@3	@5
Global-RAG	0.5431	0.8491	0.9138	0.6946	0.5431	0.7226	0.7501
Hierarchical-RAG	0.4741	0.7931	0.8448	0.6265	0.4741	0.6601	0.6819
IHGR-RAG	0.5302	0.8448	0.9612	0.6953	0.5302	0.7140	0.7619

Table 3. A comparative analysis of the performance of three retrieval strategies during the generation stage on the TCAD dataset.

Model	NR	GCA	GSA
Global-RAG	0.7586	0.7862	0.8276
Hierarchical-RAG	0.8276	0.7724	0.8138
IHGR-RAG	0.7414	0.8276	0.8552

Table 4. A comparative analysis of the performance of three retrieval strategies during the retrieval stage on the PECAD dataset.

Model	Retrieval Precision			MRR	Retrieval NDCG
Model	@1	@3	@5	MRR	@1	@3	@5
Global-RAG	0.5219	0.7713	0.8126	0.6432	0.5219	0.6688	0.6862
Hierarchical-RAG	0.4443	0.6540	0.6846	0.5457	0.4443	0.5683	0.5810
IHGR-RAG	0.4715	0.7514	0.8406	0.6160	0.4715	0.6357	0.6724

Table 5. A comparative analysis of the performance of three retrieval strategies during the generation stage on the PECAD dataset.

Model	NR	GCA	GSA
Global-RAG	0.5380	0.6572	0.7424
Hierarchical-RAG	0.6337	0.6380	0.7061
IHGR-RAG	0.5578	0.7173	0.7827

Table 6. A comparison of the inference latency and memory usage of three LLMs on the IHGR-RAG framework.

Type of LLM	Inference Latency	Memory Usage
Type of LLM	(Seconds)	(MB)
Llama3-8B-Chinese-Chat	16.38	30,850
Baichuan2-13B-Chat	172.25	49,441
Qwen2.5-14B-Instruct	20.32	45,500

Table 7. The performance comparison of multiple methods at the retrieval stage of the ablation experiment.

Method	Retrieval Precision			MRR	Retrieval NDCG
Method	@1	@3	@5	MRR	@1	@3	@5
w/o PL ¹	0.5474	0.8276	0.9224	0.6937	0.5474	0.7118	0.7511
w/o STC ²	0.5690	0.8491	0.9397	0.7154	0.5690	0.7344	0.7717
w/o QR ³	0.5647	0.8362	0.9224	0.7024	0.5647	0.7213	0.7577
w/o reranker ⁴	0.1638	0.4440	0.7414	0.3518	0.1638	0.3259	0.4472
w/o CoT ⁵	0.5216	0.8448	0.9569	0.6933	0.5216	0.7137	0.7595
IHGR-RAG	0.5302	0.8448	0.9612	0.6953	0.5302	0.7140	0.7619

¹ The template within the query rewriting mechanism lacks the prompt content for generating “pseudo-labels”. ² The template within the query rewriting mechanism lacks the prompt content for generating short-text classification labels. ³ The method without any template in the query rewriting mechanism only includes the original query. ⁴ The method does not utilize a reranker in the IHGR-RAG framework. ⁵ The template within the guideline matching mechanism lacks prompt content for CoT instruction.

Table 8. The performance comparison of multiple methods at the generation stage of the ablation experiment.

Method	NR	GCA	GSA
w/o PL ¹	0.6897	0.7828	0.8034
w/o STC ²	0.7931	0.8069	0.8483
w/o QR ³	0.7586	0.8069	0.8414
w/o reranker ⁴	0.6897	0.6862	0.7586
w/o CoT ⁵	0.7414	0.8138	0.8345
IHGR-RAG	0.7414	0.8276	0.8552

The meanings of the superscript symbols ^{1 2 3 4 5} in the table can be found in Table 7.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, Z.; Qi, D.; Liu, H.; Zhang, S. IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment. Electronics 2025, 14, 3284. https://doi.org/10.3390/electronics14163284

AMA Style

Ye Z, Qi D, Liu H, Zhang S. IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment. Electronics. 2025; 14(16):3284. https://doi.org/10.3390/electronics14163284

Chicago/Turabian Style

Ye, Zhenhao, Donglian Qi, Hanlin Liu, and Siqi Zhang. 2025. "IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment" Electronics 14, no. 16: 3284. https://doi.org/10.3390/electronics14163284

APA Style

Ye, Z., Qi, D., Liu, H., & Zhang, S. (2025). IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment. Electronics, 14(16), 3284. https://doi.org/10.3390/electronics14163284

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IHGR-RAG: An Enhanced Retrieval-Augmented Generation Framework for Accurate and Interpretable Power Equipment Condition Assessment

Abstract

1. Introduction

2. Related Work

2.1. Condition Assessment of Power Equipment

2.2. Optimization Method of RAG

2.3. Evaluation of RAG

3. Method

3.1. Model Architecture

3.2. Query Rewriting Mechanism Based on Few-Shot Learning Prompt Engineering

3.3. Integration of Hierarchical and Global Retrieval Mechanisms

3.4. Guideline Matching Mechanism

4. Experiments

4.1. Model Parameter Settings and Dataset

4.2. Evaluation Benchmark

4.3. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI