4.1. Model Parameter Settings and Dataset
The IHGR-RAG framework achieves multi-model compatibility through a modular design, and its core architecture supports the flexible combination of mainstream text embedding models and an LLM, where the LLM can be either closed-source or open-source, regardless of whether it has been fine-tuned or not. Considering the security characteristics of data in the power sector, this study built an offline deployment environment using open-source models for performance evaluation. The vector index was built based on the FAISS database [
44] to achieve efficient similarity retrieval. The computing environment was deployed on a 4×NVIDIA 3090 (24 GB VRAM) GPU cluster, with the programming environment being Python 3.9, and the complete chain was built by integrating the Pytorch deep learning platform and the LangChain 0.1.9 development framework to ensure the reproducibility of the experiments.
To optimize resource utilization, the Qwen2.5-14B-Instruct model was employed as the baseline generative LLM in both stages of the processing pipeline. Greedy decoding was utilized to minimize output uncertainty. Specifically, the baseline embedding model and reranking model were designated as bge-m3 and bce-reranker-base-v1, respectively. In the initial experiments, the top-k retrieval parameters were configured as follows: top was set to seven, while the other parameters, including top , top , and top , were all set to five.
The experimental dataset architecture consisted of three core components: (1) a guideline knowledge base serving as the index source; (2) a collection text of equipment failure features as queries; (3) manually annotated gold standard labels for model performance evaluation.
Among them, the construction of the knowledge base was based on a coding system to classify the guidelines, where each code number corresponded to a guideline item. The text of equipment failure feature descriptions usually comes from defect reports, maintenance reports, and test reports, as well as system alarm information. The content format is not fixed, but generally includes key information such as the substation name, equipment name and its dispatch number. We initially selected 280 transformer failure feature descriptions from daily reports and alarm records and manually annotated them to construct the Transformer Condition Assessment Dataset (TCAD), which consisted of 280 input queries and their corresponding guideline items, ensuring high annotation accuracy and data diversity. In addition, classification labels were utilized to differentiate positive samples (label 1) that corresponded to specific guideline items from negative samples (label 2) that lacked such a correspondence. Given the limited total sample size, we set the positive-to-negative sample ratio at : to ensure an adequate representation of negative instances.
To enhance the statistical significance of the experiment, we selected four types of equipment, including transformers, SF6 circuit breakers, Gas Insulated Switchgear (GIS), and Capacitive Voltage Transformers (CVT), to construct the Power Equipment Condition Assessment Dataset (PECAD). This dataset comprised 1514 input queries and their corresponding guideline items, with the sample proportions for each equipment type detailed in
Table 1. The dataset was constructed using LLM-based data augmentation, grounded in established guidelines and a small set of real-world daily reports and alarm records, to generate semantically similar input queries. Half of the dataset was randomly selected for manual verification, revealing a
consistency rate between the automatically generated query–guideline pairs and the manually annotated references. The dataset encompassed a total of 542 distinct guideline items across all four equipment types. To illustrate the proposed method’s sensitivity to guideline formatting, we modified a subset of SF6 circuit breaker guideline items by removing Functional Elements or Failure Models information, in alignment with the actual operational characteristics of the corresponding guideline items. For example, SF6 circuit breaker—operating mechanism—opening and closing circuit/(The failure mode is absent here)—opening and closing coil burned out. Given the sufficient size of the PECAD, the positive-to-negative sample ratio was set at
:
to ensure balanced representation of both classes. An illustration of model input and output is provided in
Appendix A, as shown in
Figure A3.
4.2. Evaluation Benchmark
To accurately assess whether the model could provide the unique correct guideline item associated with the query, we improved the RAG end-to-end evaluation benchmark and proposed an independent evaluation benchmark specifically for the dynamic assessment of power equipment condition. This benchmark comprehensively evaluates the model’s performance by separately assessing the retrieval stage and the generation stage. It employs an exact match mechanism based on guideline code numbers. A correct match is determined only when the code number, derived from the output deduction content, exactly matches the guideline code number.
When evaluating the performance of the model retrieval stage, we selected three variables,
,
, and
, as the main evaluation indicators. Generally, the variable
utilized to evaluate the retrieval performance of a model indicates that for each query, there may be multiple positive examples among the top-k retrieval results. For instance, if there are three positive examples among the top-five retrieval results, the variable
should be calculated as 0.6. However, we required a single exact retrieval result conforming to the annotation guideline. Therefore, the traditional variable
was not applicable to the current scenario. In contrast, the variable
we propose was specifically designed to assess whether the retrieval result corresponding to each query contained a unique correct match to the annotation guideline. Its calculation formula is as follows:
where
represents the total number of all queries in the dataset with a classification label of one, the same as the number of positive samples.
indicates the number of queries among the aforementioned queries for which the top-k retrieval results after reranking contain the unique correct guideline, regardless of the specific order of this guideline within the top-k results.
The variable
was used to measure the ranking of the correct chunk among all retrieval results for a query within the top-k chunks. The formula for its calculation is as follows:
where
represents the total number of queries, and
indicates the rank position of the first correct retrieval result in the
i-
query.
The variable
was used to measure the ranking position of the correct retrieval result among all the retrieval results for each query. Since in the model evaluation task, only matching the unique correct guideline was of practical significance, the setting logic of the relevance score
was significantly different from the traditional
variable. The formula of the traditional
can be expressed as:
where
represents the relevance score of the ith retrieval result, usually an integer ranging from zero to nine;
indicates the number of sets formed among the top-k retrieval results when they are sorted in descending order of their true relevance. Specifically, in the dataset, if the
i-
retrieval result of the qth query matches the labeled guideline (
), the relevance score of this retrieval result is set to one; otherwise, the relevance scores of all other retrieval results are set to zero. The formula for
in our benchmark can be expressed as:
Since
, the formulas of
and
can be expressed as:
When the chunk that matches the correct guideline is retrieved but does not appear in the top-k retrieval results (
), it indicates that the value of variable
is zero. The
represents the average
of all positive samples. The larger its value, the higher the ranking of the unique correct retrieval result for most input positive samples. The formula for
can be expressed as:
where
indicates the total number of positive samples.
On the other hand, in the text generation stage, we propose to adopt Negative Rejection (NR), Generated Content Accuracy (GCA), and Generated Scores’ Accuracy (GSA) as the evaluation metrics, which are denoted by the variables , , and , respectively.
As mentioned earlier, an LLM may generate the complete content of a guideline or a negative response of “Please transfer to manual processing”. For query samples classified as label 2, since they have been manually labeled as having no correct associated guideline, we expect the LLM to output a negative response. NR was employed to evaluate the LLM’s ability to “correctly reject”, and its formula is as follows:
where
denotes the number of queries in the dataset with a classification label of 2, the same as the total number of negative samples.
indicates the number of queries for which the LLM generates the “correctly reject” response.
For the results of LLM generation, the generated content may not be entirely accurate.
was utilized to assess the correctness of exact matches between the output deduction content and the annotated deduction content in the dataset. Its calculation formula is as follows:
where
represents the number of queries for which the LLM correctly generates the deduction content.
In addition, some operation and maintenance personnel may only focus on the deduction values in the equipment condition assessment and use them as the basis for determining the current condition level of the equipment, without necessarily paying attention to the specific deduction content. In such cases, even if the deduction content generated by the LLM is incorrect, the deduction values it outputs may still be consistent with the correct deduction values associated with the guideline. Therefore, it is common for
to be higher than
. The formula for
is as follows:
where
represents the number of queries for which the LLM correctly generates the deduction value among positive samples.
represents the number of queries for which the LLM correctly generates the zero deduction value among negative samples, which is numerically consistent with
.
4.3. Experimental Results
We first performed comparative experiments to validate the necessity of the IHGR-RAG model and to assess its potential performance benefits. The experiments specifically evaluated three distinct retrieval strategies: (1) individual global retrieval approach (Global-RAG), which is functionally equivalent to the conventional RAG; (2) individual hierarchical retrieval approach (Hierarchical-RAG); and (3) integrated hierarchical and global retrieval approach (IHGR-RAG). To ensure experimental fairness, the text generation and text alignment stages remained identical across all three strategies. The sole variable was the specific methodology employed in the retrieval stage. Furthermore, the retrieval process for each strategy utilized reranked text chunks as inputs.
The results of the retrieval stage experiments on the TCAD dataset are presented in
Table 2. The IHGR-RAG model proposed in this paper demonstrated superior performance compared to the individual global and hierarchical retrieval models across key indicators of the retrieval stage. A larger pool of candidate chunks increases the likelihood of including the uniquely correct guideline item. Although IHGR-RAG did not achieve the highest performance in terms of
and
, as well as
and
, we indeed passed five candidate chunks from the retrieval stage to the generation stage. Therefore, our primary focus was on evaluating the performance of these metrics at
. Specifically, the
for IHGR-RAG was 0.9612, indicating that
of the positive query samples successfully identified the correct guideline among the top-five candidate guidelines. This represents an increase of
and
over the performance of solely retrieval models, respectively. When considering the ranking, for the values of
and
, IHGR-RAG had slightly improved performance over the other two models. The reason why IHGR-RAG outperformed the global approach in terms of
is that it incorporates hierarchical results of candidate chunks, thereby avoiding overfitting of the global mechanism and enhancing result diversity. This diversity proves to be highly beneficial for improving the performance of metrics at the generation stage.
At the generation stage, the
value of IHGR-RAG was 0.8276, indicating an
probability of accurately matching the sole correct guideline item for all positive and negative samples, as shown in
Table 3. That probability was
and
higher than that of Global-RAG and Hierarchical-RAG, respectively. For
, which only considers the score, the value for IHGR-RAG was also
and
higher than that for the other two models, respectively. In terms of
, it can be observed that Hierarchical-RAG significantly outperformed both Global-RAG and IHGR-RAG. This indicates that the hierarchical retrieval mechanism is more effective at identifying results that are incorrect, without erroneously matching guidelines or imposing unnecessary deductions on values that should not be deducted. The candidate chunks input to the generation stage are reranked by integrating the results of both hierarchical and global retrieval mechanisms, ensuring that the most relevant candidates are selected. The reranking principle involves comparing each candidate chunk with the input query, which results in a higher overall similarity between the candidate chunks and the query. Although this increases the likelihood of misclassifying negative samples during the generation stage, thereby compromising
performance, it is important to note that this strategy significantly benefits the ultimate objective of accurately identifying the unique correct guideline item.
Second, the comparative experimental results of the three retrieval mechanisms on the PECAD dataset are presented in
Table 4 and
Table 5. The execution times of the global, hierarchical, and hybrid retrieval methods were 30,788 s, 29,954 s, and 31,088 s, respectively. Inference latency was averaged at approximately 20 s, with no notable differences among the methods. In general, the global retrieval mechanism achieved superior results in terms of metrics
and
but performed suboptimally in the generation stage. Since the retrieval stage metrics do not account for negative samples, although the global retrieval mechanism slightly outperformed the hybrid retrieval mechanism in
for positive samples, it was generally inferior to the hybrid mechanism when negative samples were considered in the generation stage metrics.
The hierarchical retrieval mechanism demonstrated the poorest performance at the retrieval stage on this dataset. However, it improved the metric by and compared to the other two mechanisms, respectively. The underlying reason for this phenomenon lies in the similarity between the component information described in the GIS guideline and the types of various power equipment. For instance, GIS components include circuit breakers, disconnectors, grounding disconnectors, current transformers, voltage transformers, and surge arresters. These components were easily matched with the guidelines of other equipment types during the hierarchical retrieval stage. Nevertheless, the hierarchical retrieval mechanism still exhibited strong performance for other equipment guidelines, particularly when functional elements or failure modes were missing in certain items of the SF6 circuit breaker guideline.
In the face of complex variations in guidelines, the hybrid retrieval mechanism remained the most effective in terms of , which contributed to improved performance at the generation stage. The hybrid mechanism improved performance by and compared to the global and hierarchical mechanisms, respectively, and performance by and . Regarding , the hybrid retrieval mechanism benefited from the hierarchical mechanism’s contribution to the diversity of candidate texts and ultimately outperformed the global retrieval mechanism on the PECAD dataset.
Third, we examined the impact of the number of candidate chunks on the objective of matching the sole correct guideline. We selected two variables on the TCAD dataset,
and
, with various combinations of values ranging from four to six, and analyzed the performance of the IHGR-RAG model with respect to
and
. As can be seen from
Figure 3, when the values of
and
were close to four, the performance of
and
was superior to that of other value ranges. Compared with
, when the value of
was smaller, the model achieved better performance. This indicates that a larger number of candidate global retrieval chunks may lead to greater confusion for the model.
Fourth, we investigated the impact of various LLMs on the retrieval and generation outcomes of the IHGR-RAG framework on the TCAD dataset. As demonstrated in
Figure 4, the IHGR-RAG model that employed Qwen2.5-14B-Instruct as its LLM marginally outperformed its counterparts employing the other two LLMs across the three evaluated retrieval metrics:
,
, and
. The IHGR-RAG model that utilized Qwen2.5-14B-Instruct significantly outperformed the other two scenarios in terms of both
and
values, achieving an improvement of at least
. In contrast, the model employing Baichuan2-13B-Chat as the LLM demonstrated the best
performance but exhibited the poorest performance in terms of
and
. We attribute this discrepancy to the more cautious judgment strategy adopted by this LLM model. In addition, based on the comparative efficiency data presented in
Table 6, Qwen2.5-14B-Instruct was selected primarily due to its optimal balance between inference latency and model capability.
Finally, we designed ablation experiments on the TCAD dataset to separately verify whether each of the improvements we proposed had a positive effect on model performance. Each value of the model ran three times and passed the Student’s
t-test (
p < 0.05). Due to the implementation of greedy decoding in the LLM, the output remained consistent across all inference rounds. In
Table 7, the IHGR-RAG model slightly outperformed the other models in terms of the two retrieval metrics:
and
. Regarding the MRR metric, the performance of our proposed model was also above average. We observed that the reranker method significantly enhanced the model’s performance. For the more critical
and
metrics in
Table 8, our proposed model outperformed those in ablation experiment scenarios, thereby validating the effectiveness of our proposed improvements. Specifically, for the
metric, the performance improvement of IHGR-RAG over other models ranged from
to
. For the
metric, the improvement of IHGR-RAG over other models reached from
to
.