4.1. Model Parameter Settings and Dataset
The IHGR-RAG framework achieves multi-model compatibility through a modular design, and its core architecture supports the flexible combination of mainstream text embedding models and an LLM, where the LLM can be either closed-source or open-source, regardless of whether it has been fine-tuned or not. Considering the security characteristics of data in the power sector, this study built an offline deployment environment using open-source models for performance evaluation. The vector index was built based on the FAISS database [
44] to achieve efficient similarity retrieval. The computing environment was deployed on a 4×NVIDIA 3090 (24 GB VRAM) GPU cluster, with the programming environment being Python 3.9, and the complete chain was built by integrating the Pytorch deep learning platform and the LangChain 0.1.9 development framework to ensure the reproducibility of the experiments.
To optimize resource utilization, the Qwen2.5-14B-Instruct model was employed as the baseline generative LLM in both stages of the processing pipeline. Greedy decoding was utilized to minimize output uncertainty. Specifically, the baseline embedding model and reranking model were designated as bge-m3 and bce-reranker-base-v1, respectively. In the initial experiments, the top-k retrieval parameters were configured as follows: top  was set to seven, while the other parameters, including top , top , and top , were all set to five.
The experimental dataset architecture consisted of three core components: (1) a guideline knowledge base serving as the index source; (2) a collection text of equipment failure features as queries; (3) manually annotated gold standard labels for model performance evaluation.
Among them, the construction of the knowledge base was based on a coding system to classify the guidelines, where each code number corresponded to a guideline item. The text of equipment failure feature descriptions usually comes from defect reports, maintenance reports, and test reports, as well as system alarm information. The content format is not fixed, but generally includes key information such as the substation name, equipment name and its dispatch number. We initially selected 280 transformer failure feature descriptions from daily reports and alarm records and manually annotated them to construct the Transformer Condition Assessment Dataset (TCAD), which consisted of 280 input queries and their corresponding guideline items, ensuring high annotation accuracy and data diversity. In addition, classification labels were utilized to differentiate positive samples (label 1) that corresponded to specific guideline items from negative samples (label 2) that lacked such a correspondence. Given the limited total sample size, we set the positive-to-negative sample ratio at : to ensure an adequate representation of negative instances.
To enhance the statistical significance of the experiment, we selected four types of equipment, including transformers, SF6 circuit breakers, Gas Insulated Switchgear (GIS), and Capacitive Voltage Transformers (CVT), to construct the Power Equipment Condition Assessment Dataset (PECAD). This dataset comprised 1514 input queries and their corresponding guideline items, with the sample proportions for each equipment type detailed in 
Table 1. The dataset was constructed using LLM-based data augmentation, grounded in established guidelines and a small set of real-world daily reports and alarm records, to generate semantically similar input queries. Half of the dataset was randomly selected for manual verification, revealing a 
 consistency rate between the automatically generated query–guideline pairs and the manually annotated references. The dataset encompassed a total of 542 distinct guideline items across all four equipment types. To illustrate the proposed method’s sensitivity to guideline formatting, we modified a subset of SF6 circuit breaker guideline items by removing Functional Elements or Failure Models information, in alignment with the actual operational characteristics of the corresponding guideline items. For example, SF6 circuit breaker—operating mechanism—opening and closing circuit/(The failure mode is absent here)—opening and closing coil burned out. Given the sufficient size of the PECAD, the positive-to-negative sample ratio was set at 
:
 to ensure balanced representation of both classes. An illustration of model input and output is provided in 
Appendix A, as shown in 
Figure A3.
  4.2. Evaluation Benchmark
To accurately assess whether the model could provide the unique correct guideline item associated with the query, we improved the RAG end-to-end evaluation benchmark and proposed an independent evaluation benchmark specifically for the dynamic assessment of power equipment condition. This benchmark comprehensively evaluates the model’s performance by separately assessing the retrieval stage and the generation stage. It employs an exact match mechanism based on guideline code numbers. A correct match is determined only when the code number, derived from the output deduction content, exactly matches the guideline code number.
When evaluating the performance of the model retrieval stage, we selected three variables, 
, 
, and 
, as the main evaluation indicators. Generally, the variable 
 utilized to evaluate the retrieval performance of a model indicates that for each query, there may be multiple positive examples among the top-k retrieval results. For instance, if there are three positive examples among the top-five retrieval results, the variable 
 should be calculated as 0.6. However, we required a single exact retrieval result conforming to the annotation guideline. Therefore, the traditional variable 
 was not applicable to the current scenario. In contrast, the variable 
 we propose was specifically designed to assess whether the retrieval result corresponding to each query contained a unique correct match to the annotation guideline. Its calculation formula is as follows:
        where 
 represents the total number of all queries in the dataset with a classification label of one, the same as the number of positive samples. 
 indicates the number of queries among the aforementioned queries for which the top-k retrieval results after reranking contain the unique correct guideline, regardless of the specific order of this guideline within the top-k results.
The variable 
 was used to measure the ranking of the correct chunk among all retrieval results for a query within the top-k chunks. The formula for its calculation is as follows:
        where 
 represents the total number of queries, and 
 indicates the rank position of the first correct retrieval result in the 
i-
 query.
The variable 
 was used to measure the ranking position of the correct retrieval result among all the retrieval results for each query. Since in the model evaluation task, only matching the unique correct guideline was of practical significance, the setting logic of the relevance score 
 was significantly different from the traditional 
 variable. The formula of the traditional 
 can be expressed as:
        where 
 represents the relevance score of the ith retrieval result, usually an integer ranging from zero to nine; 
 indicates the number of sets formed among the top-k retrieval results when they are sorted in descending order of their true relevance. Specifically, in the dataset, if the 
i-
 retrieval result of the qth query matches the labeled guideline (
), the relevance score of this retrieval result is set to one; otherwise, the relevance scores of all other retrieval results are set to zero. The formula for 
 in our benchmark can be expressed as:
Since 
, the formulas of 
 and 
 can be expressed as:
When the chunk that matches the correct guideline is retrieved but does not appear in the top-k retrieval results (
), it indicates that the value of variable 
 is zero. The 
 represents the average 
 of all positive samples. The larger its value, the higher the ranking of the unique correct retrieval result for most input positive samples. The formula for 
 can be expressed as:
        where 
 indicates the total number of positive samples.
On the other hand, in the text generation stage, we propose to adopt Negative Rejection (NR), Generated Content Accuracy (GCA), and Generated Scores’ Accuracy (GSA) as the evaluation metrics, which are denoted by the variables , , and , respectively.
As mentioned earlier, an LLM may generate the complete content of a guideline or a negative response of “Please transfer to manual processing”. For query samples classified as label 2, since they have been manually labeled as having no correct associated guideline, we expect the LLM to output a negative response. NR was employed to evaluate the LLM’s ability to “correctly reject”, and its formula is as follows:
        where 
 denotes the number of queries in the dataset with a classification label of 2, the same as the total number of negative samples. 
 indicates the number of queries for which the LLM generates the “correctly reject” response.
For the results of LLM generation, the generated content may not be entirely accurate. 
 was utilized to assess the correctness of exact matches between the output deduction content and the annotated deduction content in the dataset. Its calculation formula is as follows:
        where 
 represents the number of queries for which the LLM correctly generates the deduction content.
In addition, some operation and maintenance personnel may only focus on the deduction values in the equipment condition assessment and use them as the basis for determining the current condition level of the equipment, without necessarily paying attention to the specific deduction content. In such cases, even if the deduction content generated by the LLM is incorrect, the deduction values it outputs may still be consistent with the correct deduction values associated with the guideline. Therefore, it is common for 
 to be higher than 
. The formula for 
 is as follows:
        where 
 represents the number of queries for which the LLM correctly generates the deduction value among positive samples. 
 represents the number of queries for which the LLM correctly generates the zero deduction value among negative samples, which is numerically consistent with 
.
  4.3. Experimental Results
We first performed comparative experiments to validate the necessity of the IHGR-RAG model and to assess its potential performance benefits. The experiments specifically evaluated three distinct retrieval strategies: (1) individual global retrieval approach (Global-RAG), which is functionally equivalent to the conventional RAG; (2) individual hierarchical retrieval approach (Hierarchical-RAG); and (3) integrated hierarchical and global retrieval approach (IHGR-RAG). To ensure experimental fairness, the text generation and text alignment stages remained identical across all three strategies. The sole variable was the specific methodology employed in the retrieval stage. Furthermore, the retrieval process for each strategy utilized reranked text chunks as inputs.
The results of the retrieval stage experiments on the TCAD dataset are presented in 
Table 2. The IHGR-RAG model proposed in this paper demonstrated superior performance compared to the individual global and hierarchical retrieval models across key indicators of the retrieval stage. A larger pool of candidate chunks increases the likelihood of including the uniquely correct guideline item. Although IHGR-RAG did not achieve the highest performance in terms of 
 and 
, as well as 
 and 
, we indeed passed five candidate chunks from the retrieval stage to the generation stage. Therefore, our primary focus was on evaluating the performance of these metrics at 
. Specifically, the 
 for IHGR-RAG was 0.9612, indicating that 
 of the positive query samples successfully identified the correct guideline among the top-five candidate guidelines. This represents an increase of 
 and 
 over the performance of solely retrieval models, respectively. When considering the ranking, for the values of 
 and 
, IHGR-RAG had slightly improved performance over the other two models. The reason why IHGR-RAG outperformed the global approach in terms of 
 is that it incorporates hierarchical results of candidate chunks, thereby avoiding overfitting of the global mechanism and enhancing result diversity. This diversity proves to be highly beneficial for improving the performance of metrics at the generation stage.
At the generation stage, the 
 value of IHGR-RAG was 0.8276, indicating an 
 probability of accurately matching the sole correct guideline item for all positive and negative samples, as shown in 
Table 3. That probability was 
 and 
 higher than that of Global-RAG and Hierarchical-RAG, respectively. For 
, which only considers the score, the value for IHGR-RAG was also 
 and 
 higher than that for the other two models, respectively. In terms of 
, it can be observed that Hierarchical-RAG significantly outperformed both Global-RAG and IHGR-RAG. This indicates that the hierarchical retrieval mechanism is more effective at identifying results that are incorrect, without erroneously matching guidelines or imposing unnecessary deductions on values that should not be deducted. The candidate chunks input to the generation stage are reranked by integrating the results of both hierarchical and global retrieval mechanisms, ensuring that the most relevant candidates are selected. The reranking principle involves comparing each candidate chunk with the input query, which results in a higher overall similarity between the candidate chunks and the query. Although this increases the likelihood of misclassifying negative samples during the generation stage, thereby compromising 
 performance, it is important to note that this strategy significantly benefits the ultimate objective of accurately identifying the unique correct guideline item.
Second, the comparative experimental results of the three retrieval mechanisms on the PECAD dataset are presented in 
Table 4 and 
Table 5. The execution times of the global, hierarchical, and hybrid retrieval methods were 30,788 s, 29,954 s, and 31,088 s, respectively. Inference latency was averaged at approximately 20 s, with no notable differences among the methods. In general, the global retrieval mechanism achieved superior results in terms of metrics 
 and 
 but performed suboptimally in the generation stage. Since the retrieval stage metrics do not account for negative samples, although the global retrieval mechanism slightly outperformed the hybrid retrieval mechanism in 
 for positive samples, it was generally inferior to the hybrid mechanism when negative samples were considered in the generation stage metrics.
The hierarchical retrieval mechanism demonstrated the poorest performance at the retrieval stage on this dataset. However, it improved the  metric by  and  compared to the other two mechanisms, respectively. The underlying reason for this phenomenon lies in the similarity between the component information described in the GIS guideline and the types of various power equipment. For instance, GIS components include circuit breakers, disconnectors, grounding disconnectors, current transformers, voltage transformers, and surge arresters. These components were easily matched with the guidelines of other equipment types during the hierarchical retrieval stage. Nevertheless, the hierarchical retrieval mechanism still exhibited strong performance for other equipment guidelines, particularly when functional elements or failure modes were missing in certain items of the SF6 circuit breaker guideline.
In the face of complex variations in guidelines, the hybrid retrieval mechanism remained the most effective in terms of , which contributed to improved performance at the generation stage. The hybrid mechanism improved  performance by  and  compared to the global and hierarchical mechanisms, respectively, and  performance by  and . Regarding , the hybrid retrieval mechanism benefited from the hierarchical mechanism’s contribution to the diversity of candidate texts and ultimately outperformed the global retrieval mechanism on the PECAD dataset.
Third, we examined the impact of the number of candidate chunks on the objective of matching the sole correct guideline. We selected two variables on the TCAD dataset, 
 and 
, with various combinations of values ranging from four to six, and analyzed the performance of the IHGR-RAG model with respect to 
 and 
. As can be seen from 
Figure 3, when the values of 
 and 
 were close to four, the performance of 
 and 
 was superior to that of other value ranges. Compared with 
, when the value of 
 was smaller, the model achieved better performance. This indicates that a larger number of candidate global retrieval chunks may lead to greater confusion for the model.
Fourth, we investigated the impact of various LLMs on the retrieval and generation outcomes of the IHGR-RAG framework on the TCAD dataset. As demonstrated in 
Figure 4, the IHGR-RAG model that employed Qwen2.5-14B-Instruct as its LLM marginally outperformed its counterparts employing the other two LLMs across the three evaluated retrieval metrics: 
, 
, and 
. The IHGR-RAG model that utilized Qwen2.5-14B-Instruct significantly outperformed the other two scenarios in terms of both 
 and 
 values, achieving an improvement of at least 
. In contrast, the model employing Baichuan2-13B-Chat as the LLM demonstrated the best 
 performance but exhibited the poorest performance in terms of 
 and 
. We attribute this discrepancy to the more cautious judgment strategy adopted by this LLM model. In addition, based on the comparative efficiency data presented in 
Table 6, Qwen2.5-14B-Instruct was selected primarily due to its optimal balance between inference latency and model capability.
Finally, we designed ablation experiments on the TCAD dataset to separately verify whether each of the improvements we proposed had a positive effect on model performance. Each value of the model ran three times and passed the Student’s 
t-test (
p < 0.05). Due to the implementation of greedy decoding in the LLM, the output remained consistent across all inference rounds. In 
Table 7, the IHGR-RAG model slightly outperformed the other models in terms of the two retrieval metrics: 
 and 
. Regarding the MRR metric, the performance of our proposed model was also above average. We observed that the reranker method significantly enhanced the model’s performance. For the more critical 
 and 
 metrics in 
Table 8, our proposed model outperformed those in ablation experiment scenarios, thereby validating the effectiveness of our proposed improvements. Specifically, for the 
 metric, the performance improvement of IHGR-RAG over other models ranged from 
 to 
. For the 
 metric, the improvement of IHGR-RAG over other models reached from 
 to 
.