LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design

Xu, Yan; Wang, Tao; Yuan, Yang; Huang, Ziyue; Chen, Xi; Zhang, Bo; Zhang, Xiaorong; Wang, Zehua

doi:10.3390/app15084134

Open AccessArticle

LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design

by

Yan Xu

¹,

Tao Wang

¹,

Yang Yuan

¹,

Ziyue Huang

¹,

Xi Chen

^2,*

,

Bo Zhang

³

,

Xiaorong Zhang

² and

Zehua Wang

⁴

¹

State Grid Jiangsu Electric Power Design Consulting Co., Ltd., Xuzhou Survey and Design Branch Company, Xuzhou 221000, China

²

School of Mechanical and Civil Engineering, China University of Mining and Technology, Xuzhou 221000, China

³

School of Computer Science & Technology, China University of Mining and Technology, Xuzhou 221000, China

⁴

Department of Electrical and Computer Engineering, The University of British Columbia, Vancouver, BC V6T 1Z4, Canada

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4134; https://doi.org/10.3390/app15084134

Submission received: 27 February 2025 / Revised: 3 April 2025 / Accepted: 6 April 2025 / Published: 9 April 2025

(This article belongs to the Special Issue Advances in Smart Construction and Intelligent Buildings)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Traditional methods for urban power grid design have struggled to meet the demands of multi-energy integration and high resilience scenarios due to issues such as delayed updates of terminology and semantic ambiguity. Current techniques for constructing domain-specific lexicons face challenges like the insufficient coverage of specialized vocabulary and imprecise synonym mining, which restrict the semantic parsing capabilities of intelligent design systems. To address these challenges, this study proposes a framework for constructing a domain-specific lexicon for urban power grid design based on Large Language Models (LLMs). The aim is to enhance the accuracy and practicality of the lexicon through multi-level term extraction and synonym expansion. Initially, a structured corpus covering national and industry standards in the field of power was constructed. An improved Term Frequency–Inverse Document Frequency (TF-IDF) algorithm, combined with mutual information and adjacency entropy filtering mechanisms, was utilized to extract high-quality seed vocabulary from 3426 candidate terms. Leveraging LLMs, multi-level prompt templates were designed to guide synonym mining, incorporating a self-correction mechanism for semantic verification to mitigate errors caused by model hallucinations. This approach successfully built a domain-specific lexicon comprising 3426 core seed words and 10,745 synonyms. The average cosine similarity of synonym pairs reached 0.86, and expert validation confirmed an accuracy rate of 89.3%; text classification experiments showed that integrating the domain-specific dictionary improved the classifier’s F1-score by 9.2%, demonstrating the effectiveness of the method. This research innovatively constructs a high-precision terminology dictionary in the field of power design for the first time through embedding domain-driven constraints and validation workflows, solving the problems of insufficient coverage and imprecise expansion of traditional methods, and supporting the development of semantically intelligent systems for smart urban power grid design, with significant practical application value.

Keywords:

electric power grid design; domain-specific lexicon; synonym mining; large language models; natural language processing

1. Introduction

Traditional methods for power grid design lack dynamic knowledge-updating mechanisms, making it challenging to adapt to business scenarios requiring multi-energy integration and high resilience. Meanwhile, the deep integration of big data in power systems with artificial intelligence technologies offers new perspectives for industry innovation. However, the effective utilization of power domain knowledge still faces critical bottlenecks: existing industry terminology standards struggle to cover all domain-specific vocabulary, including colloquial terms, alternative names, and abbreviations, along with issues like ambiguous definitions and delayed updates. These challenges significantly restrict the semantic parsing capabilities of intelligent design systems [1]. In this context, as the semantic cornerstone of knowledge-intensive systems, the quality of domain-specific lexicons directly impacts the depth of understanding by intelligent systems of professional knowledge. Particularly in the power sector, a critical national infrastructure domain, constructing highly accurate domain-specific lexicons has become a core requirement for supporting the intelligent transformation of power grids.

Domain-specific lexicons serve as semantic analysis tools applicable to specific domains, compiling and distilling vocabulary from domain-related corpora to convey characteristic information of the field. The development of domain-specific lexicon construction techniques has undergone three primary stages [2]: manual methods, statistical methods, and deep learning-based methods. Early-stage lexicon construction mainly relied on expert knowledge and manually compiling entries, which typically resulted in high-quality lexicons but at the cost of significant time and human resources. Statistical methods leveraged large-scale corpora by automatically extracting terminology through calculating co-occurrence frequencies and other statistical features between words. However, these methods heavily depend on high-quality corpora and struggle with handling complex linguistic phenomena such as polysemy and collocations. Deep learning methods utilize neural networks to automatically learn word representations and incorporate contextual information for term recognition and classification. Nonetheless, this approach requires vast amounts of annotated data for model training and high computational resource demands, and it suffers from poor model interpretability. Despite these challenges, deep learning methods have provided new ideas and technical means for domain-specific lexicon construction, driving further advancements in the field.

Domain-specific lexicons are widely used in data mining across various fields, such as accident disaster domains [3], social safety event domains [4], judicial domains [5], etc. The construction of domain-specific lexicons primarily involves three stages [6]: (1) Preprocessing: this includes converting file formats and denoising texts for Chinese corpora. (2) Seed Vocabulary Extraction: after segmentation, the results are evaluated, and word strings that do not meet the criteria are filtered out to obtain seed vocabulary. (3) Synonym Mining: by calculating word vector similarities, synonyms of seed vocabulary are discovered to expand the domain-specific lexicon.

As the foundational component of a domain-specific lexicon, the quality of seed vocabulary directly impacts the effectiveness and application scope of the final lexicon. Its construction methods have evolved through three paradigms:

(1) Rule-based Methods: Early studies utilized manually defined regular expressions, grammar rules, or statistical thresholds to screen candidate terms, such as the TF-IDF weighted model [7], which selects vocabulary based on the distribution characteristics of terms within document sets. While these methods offer interpretability advantages, they rely heavily on manually designed feature rules, making it challenging to adapt to vast and complex semantic scenarios.

(2) Statistical Learning Methods: Represented by N-gram models [8] and mutual information combined with adjacency entropy methods [9], these approaches capture co-occurrence patterns of vocabulary using probabilistic models. For instance, Shen et al. [10] achieved a new term discovery accuracy rate of 76% by combining Jieba segmentation with statistical filtering for seed vocabulary extraction.

(3) Deep Learning Methods: Utilizing Word2Vec [11], BERT [12], and other embedding techniques, these methods capture semantic associations through distributed representations, enabling the identification of semantically similar words. However, deep models require large-scale annotated data, which conflicts with the scarcity of power domain corpora.

Synonym expansion not only enriches the content of domain-specific lexicons but also enhances the model’s understanding of complex semantic relationships within the domain. Synonym mining can be divided into two directions: synonym differentiation and synonym set extraction.

Synonym differentiation aims to determine whether two given words are synonyms. Early research, such as Ramage et al.’s work [13], often relied on manually constructed lexical resources to query these resources and judge synonymy between words. Mikolov et al.’s word2vec model [14] maps words into a continuous vector space, where semantically similar words are closer. This method not only improves efficiency but also identifies synonyms without pre-defined vocabularies. Additionally, some studies combine knowledge graphs with deep learning methods; for example, Guo et al. [15] proposed a neural network model incorporating knowledge graph information, significantly improving the accuracy of synonym differentiation.

Synonym set extraction involves automatically discovering new synonym combinations from large corpora. Traditional approaches utilize co-occurrence analysis [16], calculating the frequency at which words appear together in the same context as a measure of synonymy. Hamilton et al. [17] developed a Graph Neural Network (GNN)-based algorithm to extract synonym sets from text data, effectively capturing deeper semantic connections between words by modeling complex relationships. Some studies also explore unsupervised or semi-supervised learning strategies for synonym set mining [18] to reduce reliance on annotated data.

In summary, although existing methods have achieved certain successes in constructing lexicons for general or some vertical domains, the power design domain exhibits significant uniqueness:

(1) The terminology system in this domain is complex, encompassing both specialized compound terms like “tower load” and “insulation coordination”, as well as cross-domain high-frequency terms like “grounding” and “protection”. Traditional statistical models are susceptible to issues like corpus sparsity and semantic ambiguity.

(2) Current synonym mining techniques lack targeted solutions for mapping synonyms among naming entities such as power equipment models and standard codes, which are core elements of the power design knowledge system.

Therefore, the primary problems addressed in this study include the following:

(1) How can one construct a high-quality seed vocabulary set for the power design domain?

(2) How can one effectively handle complex linguistic phenomena, such as polysemy and collocations, in the power design domain based on the vocabulary set, thereby expanding it into a high-quality domain-specific lexicon?

This research aims to enhance the accuracy and efficiency of constructing domain-specific lexicons for urban power grid design using Large Language Models (LLMs), as illustrated in Figure 1.

The novelty of this research is reflected in the following:

Unlike existing studies that often use statistical methods or LLMs independently, this study proposes a three-level collaborative architecture of “statistics–extension–correction”. In the statistical stage, improved traditional mutual information algorithms are used, and domain coefficients defined by power industry standards are introduced to ensure the recall rate of core vocabulary. In the expansion phase, a power-specific prompt template was designed to constrain the LLM to generate synonyms that are suitable for engineering scenarios. In the correction phase, an innovative self-correction mechanism was integrated with the LLM to reduce the error rate of synonyms through secondary verification. Meanwhile, the static lexicon construction method will lead to an increase in the cost of updating and maintaining the dictionary. This study supports the real-time expansion of new terms through structured corpora and incremental prompt templates to meet the rapid iteration requirements of power grid design standards.

The core hypothesis of this research is that LLM-driven term extraction and synonym expansion can significantly improve both the accuracy and coverage of power design lexicons, overcoming limitations of traditional methods in handling domain-specific semantic ambiguity and terminology sparsity.

2. Materials and Methods

The framework of this study is divided into three stages: data preprocessing, seed vocabulary extraction, and synonym discovery, as illustrated in Figure 2.

In the data preprocessing stage, we first define the scope of our corpus and perform the slicing and recombination of standard entries to create a structured corpus suitable for subsequent processing needs. In the seed vocabulary extraction stage, we utilize the Jieba segmentation tool for word segmentation and combine mutual information and entropy-based methods to filter out non-specialized vocabulary, thereby establishing a multi-level candidate term filtering system that significantly enhances the accuracy of seed vocabulary extraction. Finally, in the synonym discovery stage, we leverage the semantic understanding capabilities of Large Language Models (LLMs) to conduct synonym mining within the power domain, further enriching the lexicon content and expanding its scale.

2.1. Corpus Determination and Data Preprocessing

Corpus construction strictly adheres to the principles of scientific rigor, representativeness, and practicality. The primary sources are national and industry standards, specifically including the following:

Core Standard Set: This includes 17 current standards, such as GB/T 50934-2013 [19] “Design Code for Power Engineering”, DL/T 5218-2012 [20] “Technical Regulations for Design of 220 kV to 750 kV Substations”, and DL/T 1033-2016 [21] “Power Industry Vocabulary”.
Extended Standard Set: This includes 6 enterprise standards, such as Q/GDW 11184-2014 “Technical Guidelines for Distribution Network Planning and Design.”

In addition, to enhance the practicality and coverage of the corpus, various types of design documents, such as preliminary design reports and construction drawing design explanations, as well as actual engineering cases, have been included.

During the data preprocessing stage, we employ an automated text processing method based on regular expressions to split the original text into entries, extracting clause-level semantic units. The matching expression used is

p a t t e r n = r^{'} 第 ([\ d .] +) 条 \ s + (. + ?) (? = 第 [\ d .] + 条 | $)^{'}

This regex is designed to match Chinese legal/article clauses format, e.g., “第1条”, “第1.2条”. It captures the clause number and its content until the next clause or the end of the document.

This process converts the original PDF documents into a structured JSON format, where each data unit contains four fields: standard number, clause number, technical content, and associated figure/table index. This provides a solid data foundation for subsequent text mining and analysis.

To ensure the quality of the corpus, we have established a comprehensive set of cleaning rules specific to the power domain, which include the following:

Removal of Non-technical Descriptions: Phrases like “This article explains as follows”: are deleted.
Standardization of Mathematical Symbols: For example, converting “≥” to “>=” and correcting “KVar” to “kvar”.
Normalization of Equipment Models: For example, standardizing “KYN28A-12” to “KYN28-12”.

2.2. Candidate Term Extraction

In the seed vocabulary extraction stage, this study constructs a three-level stop word list tailored to the characteristics of the corpus, as shown in Table 1.

Based on this list, we employ a bidirectional filtering strategy to optimize the vocabulary extraction process. The forward elimination strategy directly matches terms against the stop word lists to remove irrelevant entries, while the reverse retention strategy uses a whitelist to protect engineering-specific symbols such as “N-1” and “kA”, ensuring that critical terminology is not mistakenly deleted.

The frequency of a term’s appearance in the corpus is indicative of its likelihood of being a high-quality professional term. To extract seed vocabulary, we use the Jieba segmentation tool for word segmentation and combine it with an improved TF-IDF algorithm [22] for term frequency feature analysis. The weight calculation model is given by Formula 1:

W (t) = N \cdot f (t) \times \log (d (t) + 1 D) \times C d o m a i n (t)

(1)

where

f(t): The number of occurrences of term t in the corpus.

N: Total number of words in the corpus.

D: Total number of standard documents.

d(t): Number of documents containing term t.

Cdomain(t): Domain coefficient. Terms appearing in DL/T 1033-2016 “Power Industry Vocabulary” are assigned a coefficient of 1.5, while marginal terms receive a coefficient of 1.0.

By setting tiered screening thresholds, the extracted terms are classified into three categories based on their weights: high-frequency terms (W(t) ≥ 0.85 × max(W)), mid-frequency terms (0.6 × max(W) ≤ W(t) < 0.85 × max(W)), and low-frequency terms (W(t) < 0.6 × max(W)). Only high-frequency terms are selected for the candidate term filtering stage.

2.3. Candidate Term Filtering

To further enhance the quality of the seed vocabulary, this study proposes a filtering method based on mutual information and entropy for domain-specific professional vocabulary. This approach effectively improves the accuracy of domain term recognition by considering both the internal cohesion and external independence of vocabulary. The process involves three key steps:

1. Preprocessing and Candidate Vocabulary Generation: Perform sentence segmentation and word segmentation on the corpus text to generate high-frequency word segments as the candidate vocabulary set.

2. Mutual Information and Entropy Calculation: Calculate the mutual information and entropy for all candidate terms. The weighted sum of these two statistical metrics is used as the final scoring criterion.

3. Sorting and Manual Selection: Rank the terms based on their scores and combine manual selection to obtain a high-quality set of domain-specific seed words.

Mutual information (MI) [23] is a crucial statistic for measuring the internal cohesion of word pairs, which is particularly advantageous in Chinese word segmentation problems. It is defined by Formula (2):

M I (x, y) = l o g_{2} \frac{p (x, y)}{p (x) p (y)}

(2)

where

p(x,y): Joint probability of co-occurrence of two words.

p(x) and p(y): Marginal probabilities of each word occurring independently.

In practical applications, a higher MI value indicates a stronger bond between two characters, suggesting a higher likelihood of forming a valid word. This method is especially effective for identifying domain-specific terms and fixed expressions, as professional terminology often exhibits high internal cohesion.

Adjacent entropy [24] measures the external independence of a word string by calculating the uncertainty of adjacent characters on either side of the candidate term. Left adjacent entropy (H_L) and right adjacent entropy (H_R) are calculated using Formulas (3) and (4), respectively:

H_{l e f t} (W) = - \sum_{W_{L} \in S_{L}} P (W_{L} | W) l o g_{2} P (W_{L} | W)

(3)

H_{r i g h t} (W) = - \sum_{W_{R} \in S_{R}} P (W_{R} | W) l o g_{2} P (W_{R} | W)

(4)

where

S_L and S_R: Sets of left and right adjacent characters of the candidate term W.

P(W_L∣W) and P(W_R∣W): Conditional probabilities of specific characters being left or right adjacent characters of W.

P (W_{L} | W) = \frac{N (W_{L}, W)}{N (W)}

(5)

P (W_{R} | W) = \frac{N (W_{R}, W)}{N (W)}

(6)

where

N(W_L,W) and N(W_R,W): Counts of left and right adjacent characters co-occurring with the candidate term W.

N(W): Total number of occurrences of the candidate term W.

By integrating mutual information and adjacent entropy, we establish a dual-filtering mechanism to extract core seed vocabulary relevant to urban power grid design. This method not only evaluates the internal cohesion strength of vocabulary but also assesses its independence within contextual environments, significantly enhancing the accuracy of term recognition.

The final step involves ranking the candidate terms based on their combined scores from mutual information and adjacent entropy. High-scoring terms are then manually reviewed to ensure their relevance and accuracy before inclusion in the final seed vocabulary set. This refined set serves as the foundation for subsequent synonym mining.

This comprehensive approach ensures that the resulting seed vocabulary is both precise and contextually appropriate, providing a robust basis for building an extensive and accurate domain-specific lexicon for the power design field. The filtered seed vocabulary set will be used in the next stage for synonym discovery, thereby enriching the lexicon and enhancing its utility for intelligent systems in urban power grid design.

2.4. Synonym Mining Algorithm

Large Language Models (LLMs) are a significant achievement in the field of natural language processing, capable of understanding and generating high-quality natural language text. However, directly using LLMs to obtain domain-specific synonyms carries certain risks, such as model hallucinations, where the model may generate seemingly reasonable but actually inaccurate or irrelevant terms. To address the aforementioned issues, Prompt Engineering has been proposed as an effective strategy [25]. We propose a algorithm (see Algorithm 1) which can improve the quality of the output and enhance the controllability and accuracy of the results.

For each seed vocabulary, we set up a synonym mining prompt to guide the LLM to output new terms that are semantically equivalent to the seed vocabulary. By designing multi-level prompt templates, we can guide the LLM to perform semantic reasoning for terminology. To avoid errors caused by model hallucinations, we also design a self-correction mechanism to ensure semantic consistency through secondary verification. The structured prompt design in this study includes the following components:

P r o m p t = P r e f i x + I n s t r u c t + E x a m p l e s + I n p u t

(7)

where

Prefix: A fixed module setting for the model.

Instruct: Detailed instructions on the specific task requirements, including task descriptions and desired output formats.

Examples: Specific examples of the task to control the LLM’s responses.

Input: The text to be extracted.

The extraction process primarily involves two prompts:

(1) Synonym Mining Prompt

This template is designed to guide the LLM to find synonyms for the given seed vocabulary. Considering that some terms may not have synonyms, the Input allows the model to output “无” (none) when no synonyms are found. This module uses examples such as “多元件绝缘子” (multi-element insulator) and “复合绝缘元件” (composite insulation component) to guide the model in recognizing synonym relationships between vocabulary.

(2) Self-Correction Mechanism Prompt

This module is motivated by the potential for errors or inconsistencies in the generated outputs from the LLM. We propose a self-correction strategy to encourage the model to self-verify its generated synonyms, thereby reducing errors. The Instruct in the self-correction mechanism prompt guides the LLM to read the generated vocabulary and compare it semantically with the seed vocabulary. If the generated term is consistent with the seed vocabulary, the model outputs “Yes”; otherwise, it deletes the inconsistent term and outputs the remaining valid terms.

The processing flow of the algorithm is shown in Figure 3:

The various parts of the prompt are shown in Table 2:

The pseudo-code for the algorithm is:

Algorithm 1 Synonym mining enhanced by LLM.

Input: Seed vocabulary
Output: Collection of synonyms for this vocabulary

1.    function synonym_mining(seed_word)
2.              Prompt = Prompt: Synonym mining
3.              output_word_list = llm(Prompt)
4.              if output_word_list is empty then
5.                   output_word_list = [“None”]
6.              return output_word_list
7.    end function
8.    function self_correction(word_list, seed_word)
9.              verified_words = []
10.             foreach word in word_list do
11.                   Prompt = Prompt: Self correction
12.                   verification_result = llm(Prompt)
13.                   if verification_result == “Yes” then
14.                         verified_words.append(word)
15.                   end if
16.             end foreach
17.             return verified_words
18. end function

3. Results

3.1. Seed Vocabulary Mining

In this section, the experimental data were derived from national and industry design codes and standards. Given that the raw corpus contains a large number of stop words, conjunctions, and adjectives, we used the Chinese Jieba segmentation tool to tokenize the text and obtain corresponding part-of-speech tags. During seed vocabulary mining, we extracted a total of 5623 domain-specific phrases. After filtering through mutual information and entropy-based methods, 2196 invalid phrases such as “验收情况” (acceptance situation), “按照国家” (according to the country), and “值及以上” (value and above) were filtered out. The remaining 3426 phrases were used as seed vocabulary for synonym mining, forming an initial domain lexicon with 3426 synonyms.

3.2. Parameter Settings

We utilized the Qwen-Max Large Language Model (LLM) API for our experiments. The relevant parameters are summarized in Table 3:

In this case, the selection of parameters aims to optimize the accuracy and efficiency of synonym mining. We chose a top_p-value of 0.8 because it can reduce irrelevant or low-quality output while ensuring the diversity of generated text, ensuring that the model can explore multiple possible terminology expressions while maintaining a certain level of accuracy. In order to further improve the accuracy of synonym matching, a lower temperature value (0.1) was selected to reduce the uncertainty generated by the model and make the output more inclined towards high-probability words, thereby enhancing the relevance and accuracy of the results. Considering that the response requested in this study is relatively simple, involving only synonymous entities or self-correction results, and that there are no special requirements for max_token, the model default value of 2000 was adopted to ensure sufficient generation space.

3.3. Results of Synonym Mining and Self-Correction

The synonym mining process began with a 3425-seed vocabulary. After applying the synonym mining algorithm enhanced by LLMs, the vocabulary was expanded to include 11,265 terms, with Table 4 showcasing some typical examples from the synonym mining.

Following the synonym mining phase, the self-correction stage commenced, during which unsuitable synonyms were removed, reducing the total number from 11,265 to 10,745. The corrected synonym results are displayed in Figure 3, with each seed term averaging 2–3 synonyms mined. A frequency distribution histogram is provided in Figure 4.

A typical example is “直接接触” (direct contact). During synonym mining, the model suggested synonyms like “密切接触” (close contact), “表面接触” (surface contact), and “直接接触传播” (direct contact transmission). However, in the context of power design, “直接接触” specifically refers to direct contact with live parts leading to electric shock. Thus, during the self-correction process, semantically irrelevant synonyms were removed.

3.4. Semantic Accuracy Evaluation

To ensure the semantic accuracy of the synonym mining results, this study employs a combination of cosine similarity calculations based on word embeddings and expert validation. This dual approach evaluates the generated synonyms from both automated assessment and domain knowledge perspectives.

(1) Cosine Similarity Analysis Based on Word Embeddings

We use a domain-fine-tuned pre-trained BERT model to vectorize the 10,745 pairs of synonyms (seed vocabulary and their synonyms) obtained after self-correction. The cosine similarity between each pair is calculated using the following formula:

C o s i n e S i m i l a r i t y = \frac{A \cdot B}{‖ A ‖ ‖ B ‖}

(8)

where

A⋅B: Dot product of the vectors for terms A and B.

‖A‖ and ‖B‖: Magnitudes (norms) of the vectors for terms A and B.

The resulting cosine similarity histogram for the synonym pairs is shown in Figure 5.

The results show that the average cosine similarity of all word pairs is 0.86 (standard deviation = 0.12), and the proportion of similarity ≥ 0.75 is 84.3%, indicating that synonyms are highly consistent in the semantic space. For example, “绝缘子” (insulator) and “复合绝缘元件” (composite insulation component) have a similarity score of 0.91. In contrast, the incorrect pairing of “直接接触” (direct contact) and “表面接触” (surface contact) has a similarity score of only 0.32, consistent with the manual correction results.

(2) Expert Validation

Three experts with over 10 years of experience in power design were invited to validate a random sample of 100 synonym pairs. The criteria for acceptance were that the synonyms should be interchangeable in engineering contexts without introducing semantic ambiguity.

The result is shown in Table 4. From a random sample of 100 synonym pairs, the three experts collectively validated 268 pairs, achieving an expert validation accuracy rate of 89.3%. This indicates the effectiveness of our method in generating accurate and contextually appropriate synonyms.

3.5. Analysis of Dictionary Effectiveness and Sensitivity

To validate the effectiveness of the domain-specific lexicon built in this study, we evaluated the quality of the dictionary based on the accuracy of domain document recognition. The experimental dataset used in this section consists of 1000 corpus entries related to power design (independent of the corpus described in Section 2.1). The evaluation was conducted by comparing the performance of a Support Vector Machine (SVM) model with and without loading the domain dictionary into the custom dictionary of the Jieba tokenizer. The specific experimental steps are as follows:

(1) The domain dictionary was loaded into the custom dictionary of Jieba for the experimental group, while the control group did not include the domain dictionary. Both groups underwent word segmentation and stop word removal processes, following the same procedures outlined in Section 2.1.

(2) The dataset was split into a training set (700 entries) and a test set (300 entries) at a ratio of 7:3.

(3) The TF-IDF algorithm was applied to construct a text feature matrix for the corpus, which was then input into the SVM model for training to generate a text classifier.

(4) The test set was used to evaluate the performance of the classifier, with accuracy, recall, and F1-score serving as the evaluation metrics.

The experimental results demonstrate that the performance of the text classifier significantly improved after loading the domain dictionary. The detailed results are presented in Table 5.

It can be observed that after loading the domain-specific dictionary, the classifier’s accuracy improved by 8.6%, recall by 9.8%, and F1-score by 9.2%. This confirms the effectiveness of the domain dictionary in enhancing the accuracy of text classification.

To further validate the rationality of the number of seed vocabulary, we adjusted the quantity of seed vocabulary and monitored changes in the F1-score. The specific experimental design is as follows:

We adjusted the high-frequency terms W(t) ≥ 0.85 × max(W) obtained from the TF-IDF algorithm results described in Section 2.2. For the control group, we experimented with different thresholds: 0.8, 0.825, 0.875, and 0.9. These thresholds respectively resulted in seed vocabularies of sizes 4,567, 3,996, 2,854, and 2,284. At each level of seed term quantity, we repeated the aforementioned text classification experiments, recording Precision (P), Recall (R), and F1-score. The experimental results are illustrated in Figure 6.

The F1-score peaked at 93.1% when the seed vocabulary’s total was 3425. Increasing the frequency coefficient to reduce the number of seed vocabulary led to a significant decline in the F1-score (dropping to a minimum of 91.1%), indicating inadequate vocabulary coverage. Conversely, decreasing the frequency coefficient to increase the seed vocabulary’s total resulted in the stabilization of the F1-score. It is evident that the current seed vocabulary total (3425) achieves an optimal balance in the F1-score, ensuring adequate dictionary coverage while avoiding excessive consumption of computational resources.

Through text classification experiments and sensitivity analysis, we have validated the effectiveness of the domain-specific dictionary constructed in this study for improving the accuracy of document recognition in the field of power design. Additionally, the sensitivity analysis demonstrates that the current seed vocabulary total (3426) achieves an optimal balance in the F1-score, providing strong support for the practical application of the lexicon in engineering contexts.

3.6. Summary of Results

Combining both automated evaluation and expert validation, we can conclude that our framework achieves a high level of semantic accuracy suitable for practical engineering applications. Specifically,

Automated Evaluation: The average cosine similarity of 0.86 and the high proportion (84.3%) of pairs with similarity scores ≥ 0.75 demonstrate strong semantic consistency among the generated synonyms.

Expert Validation: An accuracy rate of 89.3% confirms the reliability of the synonym generation process, validating the effectiveness of the LLM-guided iterative optimization mechanism.

Dictionary Effectiveness and Sensitivity Analysis: Loading the domain dictionary significantly improved the classifier’s performance, with accuracy increasing by 8.6%, recall by 9.8%, and F1-score by 9.2%. The sensitivity analysis further confirmed that the current seed vocabulary total (3425) achieves an optimal balance in the F1-score.

These findings support the robustness and applicability of the constructed lexicon in the field of urban power grid design. The integration of advanced NLP techniques with domain expertise ensures that the final lexicon is both comprehensive and precise, providing a solid foundation for intelligent systems in this specialized domain.

4. Conclusions

4.1. Research Summary

This study addresses the challenges of low-term coverage, semantic ambiguity, and outdated updates in constructing a lexicon for urban power grid design. We propose a multi-level terminology extraction and synonym expansion framework based on Large Language Models (LLMs). By building a structured corpus tailored to the power domain and integrating an improved mutual information and adjacency entropy filtering algorithm, we achieve precise extraction of high-quality seed vocabulary. Innovatively, we leverage the semantic understanding capabilities of LLMs and introduce a self-correction mechanism to significantly enhance the accuracy and efficiency of synonym mining. Experimental results demonstrate that this method effectively constructs a comprehensive and semantically consistent lexicon for power grid design, providing robust support for the semantic parsing capabilities of intelligent design systems.

The novelty of this study lies in the following:

In terms of methodology, this study proposes a three-level collaborative architecture of “statistical filtering + LLM semantic extension + self-correction”, which realizes the complementary advantages of statistical methods and LLM in the field of power design.

In terms of achievements, this study has constructed the first terminology dictionary in the field of power design that integrates compound words and named entities, forming the foundation of a smart grid design system, and has been validated in engineering practicality.

4.2. Main Conclusions

1. Enhanced Effectiveness of Domain Lexicon Construction with LLMs:

By designing structured prompt templates and a self-correction mechanism, LLMs exhibit strong contextual reasoning capabilities in synonym mining. The average cosine similarity of generated term pairs is 0.86, and the human validation accuracy exceeds 89%, indicating high reliability and precision.

2. Improved Engineering Practicality:

The constructed lexicon covers 3426 core seed vocabulary and 10,745 synonyms, substantially addressing the existing gaps in power-related dictionaries. This enhances practical applications like building domain knowledge graphs for better information retrieval, thereby improving smart Q&A systems with more accurate query interpretation and enabling semantic parsing in automated design systems for precise and efficient design generation, and it provides a solid foundation for the practical application of intelligent design systems in engineering scenarios.

4.3. Limitations and Future Directions

1. Corpus Limitations:

The current corpus primarily relies on national and industry standard texts, lacking actual engineering case studies and dynamically updated design documents. Future work should extend the corpus to include multi-source heterogeneous data to enhance the real-time relevance and adaptability of the lexicon to various scenarios.

2. Model Generalization Challenges:

The current approach has a strong dependency on specific LLM architectures, requiring further validation of its transferability across different LLM frameworks. Exploring lightweight model deployment solutions can help improve scalability and efficiency.

3. Insufficiencies in Semantic Depth Understanding:

Current synonym verification primarily relies on word vector similarity and manual checks. Future work could integrate knowledge graphs specific to the power domain to build a multimodal semantic association network, thereby enhancing the fine-grained parsing of term relationships.

Author Contributions

Conceptualization, Y.X. and X.C.; methodology, Y.X., X.C., B.Z. and Z.W.; software, T.W. and Y.Y.; validation, Y.X., X.C. and Z.H.; formal analysis, Z.H.; investigation, X.Z and X.C.; resources, Y.X., T.W., Y.Y. and Z.H.; data curation, B.Z.; writing—original draft preparation, Y.X. and X.Z.; writing—review and editing, X.C. and B.Z.; visualization, X.Z.; supervision, B.Z.; project administration, Y.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the science and technology project of State Grid Jiangsu Electric Power Design Consulting Co., Ltd. (Project No.: JW202307).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This research utilized a proprietary dataset provided by State Grid Jiangsu Electric Power Design Consulting Co., Ltd. Due to the sensitive commercial information contained within, we are unable to make the original dataset publicly available. However, we have provided detailed processing procedures and analytical methods to enable other researchers to reproduce our work on similar datasets.

Acknowledgments

The authors thank the State Grid Jiangsu Electric Power Design and Consulting Co., Ltd., for the funding support provided by the science and technology project.

Conflicts of Interest

Authors Yan Xu, Tao Wang, Yang Yuan and Ziyue Huang declare that they work in State Grid Jiangsu Electric Power Design Consulting Co., Ltd. The funders participated in the design of the study, as well as in the analyses and interpretation of data, the writing of this article or the decision to submit it for publication. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
MI	Mutual Information
BERT	Bidirectional Encoder Representations from Transformers
API	Application Programming Interface

References

Fan, H.; Fan, S.; Li, B.; Feng, C.; Luo, H. Research on Construction of Professional Dictionary in Power Dispatching Field. Electr. Power Inf. Commun. Technol. 2021, 19, 57–65. [Google Scholar]
Cheng, Y.; Huang, Y. Research and development of domain dictionary construction system. In Proceedings of the international conference on web intelligence, Leipzig, Germany, 23 August 2017; pp. 1162–1165. [Google Scholar]
Yi, R.; Zhang, T.; Xing, X.; Ma, W.; Zhang, K.; Liu, W. Construction of a disaster and accident domain dictionary integrating standard knowledge. China Stand. 2022, 15, 88–94. [Google Scholar]
Qi, W.; Yang, L. Research on the Network Public Opinion Risk Assessment of Social Security Events Incorporating Domain Risk Dictionary. Time Phys. Int. Workshop 2024, 47, 175–183. [Google Scholar]
Zhang, Z.; Mao, C.; Yu, Z.; Huang, Y.; Liu, Y. Sensitive judicial public opinion information recognition with the domain terminology dictionary. J. Chin. Inf. Process. 2022, 36, 1–18, 27. [Google Scholar]
Zhao, X.; Hu, Y.; Qin, T.; Wan, W.; Wang, Y. A Domain-Specific Lexicon for Improving Emergency Management in Gas Pipeline Networks through Knowledge Fusing. Appl. Sci. 2024, 14, 8094. [Google Scholar] [CrossRef]
Wang, Z.; Wang, D.; Li, Q. Keyword extraction from scientific research projects based on SRP-TF-IDF. Chin. J. Electron. 2021, 30, 652–657. [Google Scholar]
Lee, K.H.; Baek, R.J.; Kim, W.S. A Study on Applying Novel Reverse N-Gram for Construction of Natural Language Processing Dictionary for Healthcare Big Data Analysis. J. Converg. Cult. Technol. 2024, 10, 391–396. [Google Scholar]
Liu, W.; Liu, P.; Liu, W.; Li, N. A new word discovery algorithm based on mutual information and adjacency entropy. Appl. Res. Comput. 2019, 36, 1293–1296. [Google Scholar]
Shen, Z.; Chao, Y.; Li, X.; Zhang, W. Research on new word discovery methods for specific domains. Comput. Simul. 2022, 39, 269–273. [Google Scholar]
Song, X.-Y.; Zhao, Y.; Jin, L.-T.; Sun, Y.; Liu, T. Research on the construction of sentiment dictionary based on Word2vec. In Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence, Sanya, China, 21–23 December 2018; pp. 1–6. [Google Scholar]
Zixiao, Z.; Mao, K. Knowledge-based BERT word embedding fine-tuning for emotion recognition. Neurocomputing 2023, 552, 126488. [Google Scholar]
Ramage, D.; Rafferty, A.N.; Manning, C.D. Random walks for text semantic similarity. In Proceedings of the 2009 workshop on graph-based methods for natural language processing (TextGraphs-4), Stanford, CA, USA, 7 August 2009; pp. 23–31. [Google Scholar]
Minkov, E.; Cohen, W. Graph based similarity measures for synonym extraction from parsed text. In Proceedings of the TextGraphs-7: Graph-based Methods for Natural Language Processing, Jeju, Republic of Korea, 12 July 2012; pp. 20–24. [Google Scholar]
Guo, S.; Wang, Q.; Wang, L.; Wang, B.; Guo, L. Knowledge Graph Embedding with Iterative Guidance from Soft Rules. Proc. AAAI Conf. Artif. Intell. 2018, 32, 11918. [Google Scholar] [CrossRef]
Church, K.; Hanks, P. Word association norms, mutual information, and lexicography. Comput. Linguist. 1990, 16, 22–29. [Google Scholar]
Hamilton, W.L.; Leskovec, J.; Jurafsky, D. Diachronic word embeddings reveal statistical laws of semantic change. arXiv 2016, arXiv:1605.09096. [Google Scholar]
Yu, H.F.; Hsieh, C.J.; Yun, H.; Vishwanathan, S.V.N.; Dhillon, I.S. A scalable asynchronous distributed algorithm for topic modeling. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1340–1350. [Google Scholar]
GB/T 50934-2013; Technical Code for Seepage Prevention in Petrochemical Engineering. Available online: https://www.codeofchina.com/standard/GBT50934-2013.html (accessed on 5 April 2025).
DL/T 5218-2012; Technical Code for the Design of 220 kV~750 kV Substation. Available online: https://www.chinesestandard.net/PDF.aspx/DLT5218-2012 (accessed on 5 April 2025).
DL/T 1033-2016; Electricity—Vocabulary—Part 4: Thermal Power Generation. Available online: https://www.transcustoms.cn/GB_standards/GB_standards_cn.asp?code=DL/T%201033.4-2016 (accessed on 5 April 2025).
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New Tork, NY, USA, 1991. [Google Scholar]
Huang, J.H.; Powers, D. Chinese word segmentation based on contextual entropy. In Proceedings of the 17th Asian Pacific Conference on Language, Information and Computation, Singapore, 1–3 October 2003; pp. 152–158. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]

Figure 1. Conceptual figure of this study.

Figure 2. Framework for building a dictionary in the field of power design.

Figure 3. Algorithm process of synonym mining.

Figure 4. Histogram of synonym count after self-correction.

Figure 5. Histogram of Semantic Cosine Similarity.

Figure 6. Results of sensitivity analysis.

Table 1. List of stop words.

Category	Word Examples	Count	Filtering Criteria
General Stop Words	的, 是, 在	1892	Function Words in Chinese Grammar
Domain-specific Stop Words	相关, 规定要求	437	High-frequency Non-indicative Terms in Technical Standards
Numeric Stop Words	1.2.3, ①, 附录A	215	Non-semantic Numeric Symbols

Table 2. Template for synonym mining and self-correction prompt.

Prompt Module	Content
Prefix	You are an experienced expert in the field of power design.
Instruct_synonym_mining	Task: generate strictly synonymous terms based on the given vocabulary, and output ‘none’ if none are available. Format: comma-separated list.
Examples_synonym_mining	Input: multi-component insulator ->Output: Composite insulation component.
Instruct_self_correction	Task: determine if ‘{term}’ is strictly synonymous with ‘{seed_word}’. Format: answer ‘yes’ or‘ no ‘.
Examples_self_correction	Seed_word: multi-component insulator Term: composite insulating element -> Yes.

Table 3. Main parameter settings of the Large Language Model.

Parameter	Description	Value
top_p	The probability threshold for controlling the kernel sampling method ranges from 0 to 1, with higher values indicating higher randomness.	0.8
temperature	Control the randomness and diversity of the generated content, with a range between 0 and 2. The larger the value, the higher the flexibility of the output content.	0.1
max_tokens	Used to limit the number of tokens generated by the model.	2000

Table 4. Synonym mining results (partial typical cases).

Seed Vocabulary	Synonyms 1	Synonyms 2	Synonyms 3
高压绝缘子 (high-voltage insulator)	高压瓷瓶 (high-voltage porcelain insulator)	高压绝缘子串 (high-voltage insulator string)	高压绝缘子 (high-voltage ceramic insulator)
星型接线 (star-shaped wiring)	星形联结 (star connection)	Y连接 (Y connection)	Y型接法 (Y-shaped connection)
馈线断路器 (feeder circuit breaker)	馈线开关 (feeder switch)	馈线保护开关 (feeder protection switch)	馈线开关设备 (feeder switch equipment)

Table 5. Evaluation Results of Text Classifier.

Lexicon	P/%	R/%	F1/%
Loading	91.0	95.3	93.1
Unloading	82.4	85.5	83.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Wang, T.; Yuan, Y.; Huang, Z.; Chen, X.; Zhang, B.; Zhang, X.; Wang, Z. LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design. Appl. Sci. 2025, 15, 4134. https://doi.org/10.3390/app15084134

AMA Style

Xu Y, Wang T, Yuan Y, Huang Z, Chen X, Zhang B, Zhang X, Wang Z. LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design. Applied Sciences. 2025; 15(8):4134. https://doi.org/10.3390/app15084134

Chicago/Turabian Style

Xu, Yan, Tao Wang, Yang Yuan, Ziyue Huang, Xi Chen, Bo Zhang, Xiaorong Zhang, and Zehua Wang. 2025. "LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design" Applied Sciences 15, no. 8: 4134. https://doi.org/10.3390/app15084134

APA Style

Xu, Y., Wang, T., Yuan, Y., Huang, Z., Chen, X., Zhang, B., Zhang, X., & Wang, Z. (2025). LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design. Applied Sciences, 15(8), 4134. https://doi.org/10.3390/app15084134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLM-Enhanced Framework for Building Domain-Specific Lexicon for Urban Power Grid Design

Abstract

1. Introduction

2. Materials and Methods

2.1. Corpus Determination and Data Preprocessing

2.2. Candidate Term Extraction

2.3. Candidate Term Filtering

2.4. Synonym Mining Algorithm

3. Results

3.1. Seed Vocabulary Mining

3.2. Parameter Settings

3.3. Results of Synonym Mining and Self-Correction

3.4. Semantic Accuracy Evaluation

3.5. Analysis of Dictionary Effectiveness and Sensitivity

3.6. Summary of Results

4. Conclusions

4.1. Research Summary

4.2. Main Conclusions

4.3. Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI