Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base

Wang, Haiyuan; Zhang, Deli; Li, Jianmin; Feng, Zelong; Zhang, Feng

doi:10.3390/app15063134

Open AccessArticle

Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base

by

Haiyuan Wang

^1,2,*

,

Deli Zhang

^1,2,

Jianmin Li

^1,2,

Zelong Feng

^1,2 and

Feng Zhang

^1,2

¹

CABR Testing Center Co., Ltd., Beijing 100013, China

²

China Academy of Building Research Co., Ltd., Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3134; https://doi.org/10.3390/app15063134

Submission received: 13 February 2025 / Revised: 11 March 2025 / Accepted: 11 March 2025 / Published: 13 March 2025

(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In the field of construction engineering, there exists a dynamic evolution of extensive technical standards and specifications (e.g., GB/T and ISO series) that permeate the entire lifecycle of design, construction, and operation–maintenance. These standards require continuous version iteration to adapt to technological innovations. Engineers require specialized knowledge bases to assist in understanding and updating these standards. The advancement of large language models (LLMs) and Retrieval-Augmented Generation (RAG) technologies provides robust technical support for constructing domain-specific knowledge bases. This study developed and tested a vertical domain knowledge base construction scheme based on RAG architecture and LLMs, comprising three critical components: entropy-optimized dynamic text segmentation (EDTS), vector correlation-based chunk ranking, and iterative optimization of prompt engineering. This study employs an EDTS method to ensure information clarity and predictability within limited chunk lengths, followed by selecting 10 relevant chunks to form prompts for input into LLMs, thereby enabling efficient retrieval of vertical domain knowledge. Experimental validation using Qwen-series LLMs with a test set of 101 expert-verified questions from Chinese construction industry standard demonstrates that the overall test accuracy reaches 76%. The comparative experiments across model scales (1.5B, 3B, 7B, 14B, 32B, and 72B) quantitatively reveal the relationship between model size, answer accuracy, and execution time, providing decision-making guidance for computational resource-accuracy tradeoffs in engineering practice.

Keywords:

large language model; prompt engineering; retrieval-augmented generation; conditional entropy

1. Introduction

In the realm of construction engineering, the established standards and specifications epitomize the industry’s extensive research findings and accumulated practical experience. These norms serve as a fundamental basis for ensuring the meticulous design, execution, and final inspection of building projects [1]. Furthermore, they constitute a critical reference point for subsequent operation and maintenance activities. Adherence to these guidelines facilitates not only the structural integrity and safety of constructions but also optimizes their operational efficiency and longevity. Hence, such standards are indispensable for fostering excellence within the construction engineering domain [2,3]. Due to the wide range of knowledge involved in construction engineering, the number of relevant norms and frequent updates make it a challenging task for engineers to master the knowledge about all existing standards and specifications. Against this background, the advent of the era of artificial intelligence (AI) provides the possibility to solve this problem. As one of the important achievements of AI technology, large language model (LLM) can become an ideal choice of knowledge service in the field of construction engineering. The applications of LLMs in the construction engineering field primarily encompass standard interpretation [4], design optimization [5], intelligent Q&A support [6], result prediction [7], risk control [8] and construction management [9], significantly enhancing efficiency, reducing costs, and driving industry innovation. Although some progress has been made in the application of construction engineering, there are still shortcomings in some specific application scenarios. The field of construction engineering is notably characterized by its high degree of professionalism and intricate standard framework, requiring practitioners to possess profound professional knowledge and strictly adhere to numerous standards. In addition, specific applications frequently require supplementary contextual information, which often leads to poor performance of general LLMs in those contexts.

To solve these issues, researchers have proposed various technical solutions. Among them, Retrieval-Augmented Generation (RAG) technology stands out as an effective approach. By integrating a retrieval mechanism with generative models, RAG enables LLMs to not only respond based on learned knowledge but also retrieve the latest information from external databases in real time, thereby providing more accurate and up-to-date answers. However, within the RAG framework, traditional static text segmentation—such as dividing texts into fixed lengths based on paragraphs or word counts—can lead to semantic fragmentation, adversely affecting retrieval accuracy. To address this issue, the Entropy-Optimized Dynamic Text Segmentation (EDTS) method has been proposed. EDTS aims to optimize the segmentation process to maintain semantic integrity while improving retrieval efficiency. This is particularly beneficial for ensuring high-quality information retrieval and enhancing overall system performance. In response to the need for the knowledge bases that incorporate the latest standards and regulations in fields like construction engineering, this paper introduces a methodology for constructing a domain-specific knowledge base using LLMs and RAG technology. This approach employs the EDTS method, which adaptively determines chunk boundaries based on information entropy.

The main contributions include:

Proposing the use of RAG technology to enhance the ability of LLMs in construction engineering standards and facilitate the cost-effective and efficient establishment of a specialized knowledge base through the utilization of standard texts.
Presenting the entropy-optimized dynamic text segmentation (EDTS) method for standard texts to ensure the clarity and predictability of information within chunks of limited length.
Testing the accuracy of LLMs enhanced with domain-specific knowledge texts and revealing the relationship between model size, response accuracy, and execution time, providing a decision-making basis for the trade-off between computational power and precision in engineering practices.

2. Related Work

2.1. Large Language Model

Large language model (LLM), as a significant breakthrough in the field of natural language processing (NLP), refers to high-performance models trained on massive text data based on deep learning technologies, particularly Transformer architecture [10]. Compared to traditional NLP models, LLMs exhibit enhanced capabilities in understanding and generating natural text, while also demonstrating a certain level of logical thinking and reasoning ability, achieving or approaching human-level text generation capability. The development of LLMs began with early models like Recurrent Neural Networks (RNNs) and their variants and took a significant leap forward in 2017 with the introduction of Transformer architecture, especially with the application of self-attention mechanisms [11]. Subsequently, models such as BERT and GPT achieved remarkable performance improvements across various NLP tasks through pre-training on large datasets [12]. In recent years, the scale of models has continued to grow, with models like GPT-3 showcasing breakthrough capabilities in few-shot learning and generation [13].

However, their construction and implementation require substantial amounts of data, and the high costs associated with model training and deployment have become significant factors hindering their widespread adoption [14]. To alleviate the barriers posed by the training of LLMs to ordinary people and small research teams, some leading technology companies and research institutions have begun to open-source their model architectures, pre-trained weights, and related tools and libraries. This initiative significantly lowers the entry barrier, allowing more researchers and developers to access and utilize these advanced resources for their own projects. Table 1 lists the overview of major LLMs, including the developing company, release date, current latest version, and other key features.

For specialized fields, general LLMs do not take specific domain knowledge into account during training, thus their responses to professional knowledge are relatively weak. Through well-crafted prompts, LLMs can be guided to prioritize domain-specific knowledge in their responses, thereby enhancing their applicability in professional contexts. This approach enables users to tailor the model’s output by inputting specific contextual information or problem frameworks, ensuring alignment with industry-specific requirements and terminology, which ultimately improves the professionalism and accuracy of the generated answers.

2.2. Prompt Engineering

Prompt engineering is a technique that guides LLMs to generate more accurate and relevant outputs by designing specific input instructions (prompts), with its core focus on optimizing the interaction between models and tasks [21]. In the early stage, it evolved from simple directives (e.g., “Summarize the following text”) to advanced strategies like chain-of-thought reasoning and few-shot learning, later integrating domain-specific knowledge enhancement and multimodal fusion [22]. Current research in prompt engineering focuses on automating prompt generation [23], enhancing robustness via adversarial optimization [24], and aligning outputs with ethical constraints [25], while leveraging frameworks like LangChain [26] for domain-specific adaptation.

The composition of a generic prompt template can be delineated into four principal components: prefix, instruct, examples and input [27]. The prefix is fixed during usage and is used to set the model’s role. The instruct includes specific task-related instructions. The examples are illustrations of the specific task; if no examples are provided, it indicates that the entire task is zero-shot. The input refers to the text to be processed.

Furthermore, the following three aspects should be paid attention to in the application of prompt engineering:

Clarify task objectives. To enhance model performance in specialized tasks, it is crucial to explicitly assign roles and define granular actions, thereby narrowing the model’s focus and reducing ambiguity in open-ended tasks. Additionally, breaking complex objectives into sub-steps through task decomposition aligns outputs with domain-specific workflows, enhancing coherence and relevance in the execution of complex tasks.
Provide sufficient context. To ensure responses are grounded in verified knowledge and minimize hallucination, it is essential to employ domain anchoring by embedding authoritative references or curated data snippets. Furthermore, dynamic context enrichment can be achieved by incorporating real-time updates or scenario-specific examples, such as few-shot learning, allowing the outputs to adapt effectively to evolving professional requirements.
Set output requirements. To ensure usability in downstream applications such as reports or code generation, it is important to enforce structural constraints through the use of specific formats and validation rules.

In the application of LLMs, prompt engineering has significantly improved the controllability and professionalism of model outputs. However, it still has several main shortcomings:

Designing efficient prompts requires users to have a deep understanding of task objectives, domain knowledge, and model characteristics, which makes it difficult for ordinary users to quickly master.
Current prompt design relies on empirical trial and error, lacking a unified methodology, which results in poor reusability across different domains.

2.3. Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) integrates external knowledge retrieval with generative language models, enabling dynamic access to up-to-date or domain-specific information during text generation [28]. Within the RAG framework, a given query or task input, along with its associated contextual knowledge, is first processed by a retrieval component to extract the most relevant information fragments from a predefined knowledge base or document collection. Subsequently, these retrieved fragments are integrated as supplementary contextual inputs into the prompt to guide the final response generation process. Key features of RAG include:

Reduces hallucinations by grounding responses in retrieved evidence. RAG mitigates hallucinations, which mean the incorrect or fabricated outputs, by directly incorporating verified external data (e.g., documents, databases) into the generation process, ensuring responses align with factual sources.
Supports knowledge updates without retraining. Unlike traditional LLMs that require costly retraining to integrate new information, RAG dynamically accesses updated knowledge bases (e.g., latest research papers or news archives), enabling real-time adaptation without modifying core model parameters. This mechanism allows users to seamlessly incorporate their private data or regulatory updates, ensuring compliance and relevance [29].
Enhances multi-domain adaptability via modular retrieval systems. RAG’s architecture decouples retrieval and generation modules, allowing domain-specific retrievers (e.g., legal databases, engineering standards) to be swapped based on task requirements, improving cross-domain versatility [30].

RAG’s output quality is heavily dependent on the performance of its retrieval module, and some irrelevant or noisy retrieved documents (e.g., due to semantic mismatches) may lead to inaccurate or misleading responses generated from erroneous contexts. Moreover, to maintain the timeliness and relevance of information, it is necessary to continuously update the external knowledge base, which also incurs additional maintenance costs. These factors collectively pose certain challenges for the practical application of RAG technology [31,32].

3. Methodology

3.1. Domain Knowledge Base Framework

Although general-purpose LLMs are trained on vast textual data covering diverse topics, they lack specialized expertise in any specific technical domain. To address the need for in-depth comprehension and practical application knowledge in professional domains such as construction engineering, this study establishes a RAG-based domain-specific knowledge repository, the framework of which is illustrated in Figure 1. By compiling industry-specific standards and specifications within the construction engineering domain into an external knowledge resource, this approach mitigates the limitations of general-purpose LLMs in delivering precise, contextually grounded outputs for specialized technical tasks.

The entire framework comprises two stages: the construction process and the application process. The construction process involves establishing a professional knowledge retrieval database, enabling the dynamic updating of domain knowledge. The application process refers to the user’s utilization phase, where specific tasks are accomplished by integrating professional knowledge with LLMs. These two processes operate independently and can run asynchronously. The specific steps for the construction process are as follows:

Data Collection. For a knowledge base in the field of construction engineering, it is necessary to systematically gather and organize relevant standards, regulations, technical guidelines, and best practice cases. This includes, but is not limited to, national and local building codes, industry standards, and internationally recognized best practice guidelines, ensuring comprehensiveness and authority of the collected materials.
Document Segmentation. Preprocess the gathered materials, including steps such as file parsing and document segmentation, to facilitate subsequent information retrieval. During this process, original documents are divided into multiple smaller text chunks, enhancing the efficiency and accuracy of information retrieval.
Vectorization. Convert each text chunk into vector representations, enabling similarity matching with user queries by calculating the similarity between vectors. This process facilitates the provision of highly relevant information to users based on their queries.
Vector Storage. Store these external knowledge text chunks as vectors within a specially designed vector database to enable rapid retrieval and efficient management. This approach effectively supports real-time information queries based on content similarity, thereby enhancing the professionalism and precision of system responses.

The application process involves the following steps:

Question Vectorization. When a user poses a question, the first step is to convert this question into a vector. This transformation enables similarity matching within the vector space, preparing the query for subsequent retrieval processes.
Semantic Related Chunks. This involves using semantic search methods to identify and retrieve chunk vectors that are closest to the query vector and then converting these vectors back into their original text.
Integration Context into Prompt. By integrating the user’s question and the retrieved contextual text into prompt, the system ensures that the generated response is grounded in precise and relevant professional knowledge.
Professional Response. The questions, along with their contextual text, are fed into a LLM. Leveraging the model’s inherent language generation capabilities, it produces response that meets professional requirements.

3.2. Entropy-Optimized Dynamic Text Segmentation

In RAG systems, when external knowledge, particularly documents, is vectorized and stored, it is typically segmented into knowledge chunks. These chunks serve as the fundamental units for retrieving contextual knowledge from the vector database in subsequent stages. Consequently, the size of these knowledge chunks significantly impacts the quality of retrieval and response generation. Smaller chunks capture more specific semantic meanings but carry less contextual information, potentially causing critical information to be absent from the top retrieved chunks. Larger chunks preserve more complete contextual coherence but may introduce semantic dilution during vectorization, where fine-grained details are obscured by the broader context [33]. Thus, there is no universally optimal strategy for text segmentation. The choice depends on balancing trade-offs between specificity and context retention, requiring analysis and experimentation tailored to the specific application scenario. Table 2 provides examples of several common text segmentation methods.

This study proposes a text segmentation method named entropy-optimized dynamic text segmentation (EDTS), which dynamically partitions a text into semantically coherent chunks by minimizing local information entropy. This method ensures that each segmented unit exhibits self-contained informational completeness while maintaining contextual independence, thereby optimizing downstream tasks such as retrieval and semantic analysis.

Entropy is a measure in thermodynamics that expresses the degree of disorder in a system. Shannon introduced entropy into information theory to represent the uncertainty of an information source, where it serves as a measure of the uncertainty associated with the occurrence of a random event [39]. The definition of entropy is as shown in Equation (1).

H (X) = - \sum_{x \in X} p (x) \log_{2} p (x),

(1)

where H(X) denotes the Shannon entropy of the random variable X, X is the set of possible values of a random variable, and p(x) is the probability that the random variable X takes the value x.

In text analysis, entropy can reflect the complexity, information redundancy, or regularity of linguistic structures within a text. While there is a relationship between text length (i.e., the number of symbols in the text) and entropy, it is not a simple linear correlation; instead, it depends on the text’s statistical properties, linguistic structure, and contextual dependencies [40]. Generally, higher entropy corresponds to greater uncertainty and richer information content in the text, whereas lower entropy indicates higher predictability and increased redundancy [41].

The core methodology in this study involves:

Sentence as core unit: utilizing sentences as foundational semantic units to preserve intrinsic coherence.
Dynamic context expansion: bidirectionally extending contextual windows (forward/backward) from each sentence boundary through a controlled growth mechanism.
Entropy-driven boundary detection: Iteratively identifying optimal segmentation points by locating lexical positions with minimal local entropy within the expanded window. Equation (2), which calculates the local information entropy, is defined as follows:

H (W| c_{i}) = - \sum_{w \in W} p (w| c_{i}) {l o g}_{2} p (w| c_{i}),

(2)

where W represents the set of words within a core unit, and

c_{i}

represents the set of expanded words extended to the boundary position I.

4.: Entropy reduction: Systematically minimizing the entropy of the entire core statement through dynamically adjusting chunk boundaries, ensuring maximally self-contained information units. As shown in Equation (3), select the candidate position $i_{b}$ that minimizes the conditional entropy as the segmentation boundary.

i_{b} = \arg \min_{i} H (W| c_{i}),

(3)

Using the chain rule of entropy, the conditional entropy can be decomposed into:

H (W| c_{i}) = H (W, c_{i}) - H (c_{i}),

(4)

The Equation (4) establishes the computational relationship between conditional entropy and joint entropy.

The specific implementation algorithm of EDTS is shown in Algorithm 1. The input parameter content represents the entire read-in text, and the output parameter chunks represents the segments of the text after splitting.

Algorithm 1. EDTS Algorithm

inputs: content
outputs: chunks

1. Read in the entire text content;
2. Count the frequency of each word to form a word probability dictionary, where the key is the word and the value is the probability;
3. Use periods as delimiters to form chunks and the set S = [s₁, s₂, s₃, …, s_m];
4. Assume the current calculation is for sentence s_j, the preceding sentence is s_j₋₁, and the following sentence is s_j₊₁;
5. Split s_j into a word list W = [w₁, w₂, w₃, …, w_n], using jieba segmentation;
6. Calculate the right boundary, use jieba segmentation to split s_j₊₁ into a word list C = [c₁, c₂, c₃, …, c_k];
7. Calculate the joint entropy H(W, c₁);
8. Calculate the entropy H(c₁);
9. Calculate the conditional entropy H(W|c₁), according to Formula (4);
10. Repeat steps (7) to (9) to obtain a sequence of conditional entropies H_r expanding word by word to the right, as well as the list of conditional entropy for the right boundary H_r;
11. According to Formula (3), the boundary position on the right is determined by i_r = arg min(H_r);
12. Repeat steps (5) to (11) to obtain the left boundary position i_l by expanding word by word to the left;
13. Form a word list for a chunk with the starting position at the left boundary i_l and the ending position at the right boundary i_r;
14. Repeat steps (4) to (13) to traverse the sentences in S sequentially, wherein i_l in the first sentence is the first word of the entire text, and i_r in the last sentence is the last word of the entire text;
15. Form a chunk by taking the word at position i_l as the starting word and the word at position i_r as the ending word;
16. Traverse all statements in S and form the final output chunks.

EDTS aims to divide external knowledge texts into independent chunks. Using sentences as the central units, it dynamically expands the context window to the left and right. The optimal segmentation boundaries are determined by calculating the conditional entropy; the schematic diagram is shown in Figure 2. Specifically, the position with the minimum conditional entropy within the window is chosen as the segmentation point, ensuring that each text chunk has semantic completeness and high predictability. Finally, the text chunks are encoded as dense vectors and stored in a vector database.

3.3. Chunk Rank Based on Vector Correlation

Most semantic retrieval systems leverage pre-trained language models such as BERT (Bidirectional Encoder Representations from Transformers) [42] or GPT (Generative Pre-trained Transformer) [43,44] for text encoding. Trained on large-scale corpora, these models excel at capturing intricate semantic relationships between words and sentences, significantly enhancing retrieval accuracy and relevance. In the RAG system proposed in this study, both user queries and knowledge base text chunks are mapped into a shared high-dimensional vector space. This alignment enables chunk rank based on vector correlation, where the most relevant top-k chunks are filtered through semantic similarity metrics such as cosine similarity, ensuring precise contextual grounding for downstream generative tasks [45].

First, both the user query q and knowledge-associated text chunks C must be mapped into a unified vector space, ensuring that semantically similar texts reside in proximate regions within this space. Specifically, the query q and the text chunks C are encoded into vectors

v_{q}

and [

v_{c_{1}}, v_{c_{2}}, \dots, v_{c_{i}}

] through independent pre-trained language models:

v_{q} = f_{query} (q; θ_{q}), v_{c_{i}} = f_{chunk} (c_{i}; θ_{c}),

(5)

where

f_{query}

and

f_{chunk}

represent the query encoder and document encoder, while

θ_{q}

and

θ_{c}

are the parameters of the query encoder and document encoder, respectively.

Subsequently, the most relevant text chunks are efficiently retrieved from the large-scale knowledge base based on vector similarity. This can be measured using cosine similarity, as defined in Equation (6):

sim (q, c_{i}) = \cos (v_{q}, v_{c_{i}}) = \frac{v_{q} ∙ v_{c_{i}}}{‖v_{q}‖ ‖v_{c_{i}}‖},

(6)

This metric ensures scale-invariant comparison of vector orientations, effectively capturing semantic relevance between queries and chunks.

Finally, the most relevant text chunks are efficiently retrieved from the large-scale knowledge base based on vector similarity. Specifically, for all candidate text chunks

c_{i} \in C

, the system sorts them in descending order of similarity and returns the IDs of the k vectors with the highest similarity, as shown in Formula (7).

{IDs}_{top - k} = \arg \begin{matrix} k \\ \max \\ c_{i} \in C \end{matrix} sim (q, c_{i}),

(7)

This process ensures that the retrieved chunks are both semantically relevant and contextually aligned with the user query, providing a robust foundation for downstream tasks such as knowledge-augmented generation.

After completing the top-k retrieval based on vector similarity, the system needs to map the high-dimensional vector indices back to the original text units to achieve semantic reconstruction of the knowledge context. This process is implemented through the establishment of a bidirectional metadata mapping mechanism, the core of which lies in maintaining a strict correspondence between the vector space and the text space [46]. Each text chunk

c_{i}

is assigned a globally unique identifier

{id}_{i}

during the process of converting it into

v_{c_{i}}

, and

{id}_{i}

is stored in a structured database along with the following metadata:

Metadata (c_{i}) = {{id}_{i}, c_{i}, source_doc},

(8)

where source_doc provides information about the knowledge document.

Based on the returned vector indices

{IDs}_{top - k} = [{id}_{1}, {id}_{2}, \dots \dots, {id}_{k}]

from the retrieval, the text reconstruction is achieved by looking up these indices in the database. This process involves retrieving the corresponding text chunks associated with each of these unique identifiers [47]. At this point, top-k text chunks most relevant to the query q are obtained.

3.4. Iterative Optimization of Prompt Engineering

In RAG systems, prompt engineering serves as a critical component for optimizing generation quality. Its object is to design efficient prompt templates that guide the model to synthesize retrieved contextual information and produce accurate, coherent responses. The design principles for prompts emphasize task specificity, descriptive granularity, and unambiguous instruction. A recommended methodology involves adopting a stepwise-prompting framework with iterative refinement. This approach begins with analyzing fundamental corpus units and initial testing using minimalistic prompts. Through continuous evaluation of LLM outputs, the prompts are dynamically adjusted to enhance expressiveness and operational efficacy [48,49,50]. In this study, the optimization of prompts follows a three-step progressive approach.

Step 1: Construct prompt templates that clearly define task requirements and effectively utilize retrieval context. Typical RAG prompts include the following elements:

Task Instruction: Clearly specify the type of generation task (e.g., question answering, summarization).
Retrieval Context: Insert the top-k relevant text chunks as references.
Query Statement: The user’s original input question or command.
Format Constraints: Specify the output format (e.g., JSON, bullet points).

A typical basic prompt template structure is shown in Figure 3. In the figure, the left part is the template, and the right part is an application case.

Step 2: Optimize the integration of retrieval context and query to enhance generation relevance. In the prompt, employing interpretable or suggestive markers indicated in the [Context] to differentiate various chunks enhances the model’s focus on the primary relevant chunk, as illustrated in Figure 4. Additionally, weights can be assigned to each context chunk based on retrieval context and similarity, guiding the model to pay more attention to highly relevant segments [51]. In the example shown in Figure 4, the [Context] includes the chapter name where the chunk text is located and the weight related to the relevance of the problem, thereby guiding the LLM to focus on the relevant content.

Step 3: Adapt the generative model to task requirements through instruction tuning (Instruction Tuning). This involves making instructions clear and specific, avoiding ambiguous commands (such as “Please answer”), and instead using concrete actions (like “List three options”). Control the length of the output by specifying constraints in the instructions (e.g., “Summarize in one sentence”). Achieve domain adaptation by incorporating terminology constraints relevant to specialized fields (e.g., “Explain using construction engineering terms”). When necessary, include example input–output pairs within prompts to guide the model in mimicking format and logic, as shown in Figure 5.

Due to the variations in researchers’ habits and modes of expression, prompts in practical applications are not uniform, and the final prompt typically necessitates iterative refinement throughout the testing phase. This flexibility offers extensive scope for creative and exploratory tasks. However, it also mandates that researchers possess a deep comprehension of the specific tasks at hand and exhibit proficiency in meticulously calibrating prompts.

Through systematic prompt engineering strategies, the RAG systems can precisely leverage the retrieved contexts to generate high-quality and highly relevant textual outputs, effectively bridging the gap between knowledge retrieval and generative coherence.

4. Test Results and Discussion

4.1. Test Condition

The experimental evaluation of the RAG system in this study was conducted on Inspur server with the following hardware configuration:

CPU: Intel Xeon Gold 6226R processor (Cascade Lake architecture, 2.9 GHz base frequency);
GPU: NVIDIA RTX A5000 with 24 GB GDDR6 memory;
System memory: 512 GB DDR4-2933 ECC RAM;
Windows 10 Professional Edition operating system.

The software environment was configured with Ollama v0.3.9 framework, implementing the Qwen 2.5 series large language model for retrieval-augmented generation tasks.

In this test, the standard “Technical specification for inspecting of concrete compressive strength by rebound method” (JGJ/T 23-2011) [52] was selected as the external reference material. This standard consists of 7 chapters, appendices A to F, and explanatory notes. Multiple experts, based on common situations encountered in their work, have proposed 101 questions and provided answers according to the corresponding sections of the aforementioned standard.

These 101 test questions can be divided into three categories:

Accurately answerable questions that allow direct judgment of correctness, such as: “What should be the maximum area of a test area in square meters?”. These questions have standardized numerical answers and are scored as either correct or incorrect.
Descriptive questions, such as: “What is a test area?”. The answers to these questions are evaluated through similarity assessments to assign a quantitative score.
Hybrid questions that combine descriptive elements with standardized numerical values. These require evaluating both the numerical accuracy and textual description.

The proportion distribution of the three question types is shown in Figure 6.

This study employs the nomic-embed-text model for the embedded vector representation. Nomic-embed-text model is an open-source, high-performing text embedding model that achieves state-of-the-art performance on the MTEB benchmark while supporting context lengths up to 8192 tokens, offering transparency and reproducibility [53,54].

4.2. Test Result Evaluation Method

In the tests, corresponding rules are established to quantify the responses provided by the RAG system. The scoring ranges from 0 to 10, with 0 indicating an incorrect response and 10 indicating a fully correct response. Three different evaluation methods are developed for the three types of questions.

Type I: Directly compare the numeric values extracted from the generated response with those in the standard answer. A perfect match earns 10 points, while any discrepancy results in 0 points.

Type II: Use similarity calculation methods to assign a quantitative score based on how closely the response aligns with the standard answer.

Type III: This involves a two-step process. (1) Extract and compare numeric values, contributing 50% to the final score. (2) Calculate the similarity between the response and the standard answer, accounting for the other 50% of the score.

4.3. Test Results

To evaluate the impact of different text segmentation methods on our RAG system, we tested fixed-size chunking (FSC), sliding window chunking (SWC), paragraph-based chunking (PBC), sentence-based chunking (SBC), and entropy-optimized dynamic text segmentation (EDTS) on the expert question dataset mentioned in Section 4.1. The scenario where LLMs are directly queried without providing domain-specific data was also tested and included as a comparison baseline.

In this comparative test, we used the Qwen2.5:7B model, selecting the 10 chunks with the closest semantic relevance for each question. The test results, which compare the effectiveness of these methods within our RAG system, are summarized in Figure 7. The EDTS method demonstrates relatively high accuracy across three problem categories, achieving an accuracy rate of approximately 77% in Type I and Type II problems while maintaining competitive performance in Type III problems in which the overall test accuracy has also reached 76%. Standards and specifications documents, characterized by standardized textual structures and precise professional descriptions, represent term-dense exemplars within unstructured knowledge. Their requirement for unambiguous expression typically results in professional terminologies and high determinacy at sentence level. Although longer texts potentially carry more information, comparative testing reveals that FSC-512 configuration shows no accuracy advantage over FSC-128.

The choice of LLMs plays a crucial role in determining the outcomes. To understand the differences in performance across various model sizes, we performed the tests using six different models: 1.5B, 3B, 7B, 14B, 32B, and 72B, while keeping the text segmentation method constant. This approach allows us to analyze the effects of model size on performance metrics such as accuracy and runtime. The detailed results of these tests are shown in Figure 8, providing insights into how each model performs under identical conditions. The general trend indicates improved overall accuracy with increasing LLM parameters, though Type III problems exhibit a local maximum at 7B model scale followed by slight accuracy reduction.

The expansion of LLM scale demonstrates a progressive improvement in question-answering accuracy, while simultaneously incurring a significant increase in execution runtime as shown in Figure 9. This observation necessitates a balanced consideration of application scenarios, target accuracy thresholds, and real-time performance requirements when selecting optimal model scales for practical deployment. Therefore, under current operational conditions, the 7B model represents an optimal balance between accuracy and execution latency.

In the practical application of RAG-enhanced specialized LLMs, a critical consideration is the trade-off between model scale and deployment costs. This study’s performance comparison reveals that while a 72B parameter model offers higher accuracy, its computational resource requirements may exceed the practical constraints of most engineering sites. Conversely, a 7B parameter model achieves acceptable precision levels for common queries, providing an economically viable solution for the majority of application scenarios. For more complex problems, an on-demand invocation strategy of LLMs can be employed to achieve an optimal balance between performance and cost. This approach not only minimizes unnecessary computational resource expenditure but also ensures efficient operation across tasks of varying complexities in real-world applications.

Moreover, the knowledge base developed in this research extends beyond mere knowledge querying; it can be integrated into design or management software to conduct real-time compliance checks of construction plans against the latest industry standards, flagging any potential discrepancies. This functionality significantly enhances project management efficiency and mitigates risks associated with human error, leading to non-compliance. To address specific needs at construction sites, a lightweight retrieval-augmented architecture could be further developed, where vector databases and certain model parameters are optimized for deployment on end-user devices. This enables rapid retrieval and generation tasks even in offline environments, such as handheld devices used by construction and supervision personnel in the field. For instance, workers can access acceptance criteria without internet connectivity, greatly enhancing the convenience and timeliness of information acquisition. By carefully selecting model scales, exploring new application scenarios, and developing application program interfaces to specific needs, we can effectively transition these advanced technologies from the laboratory to actual engineering sites, thereby fostering the intelligent development of the entire industry.

4.4. Update Strategy

In contemporary engineering and technology fields, the revision of technical standards adheres to a certain periodic pattern, and some new standards are issued with technological advancements. Despite this, compared to other types of knowledge available on the web, standard documents are relatively small in scale and their content remains comparatively stable. The RAG-enhanced LLMs knowledge base updates primarily involve the newly added segmented text chunks and their corresponding vector representations, which are derived from domain-specific standards.

As standards constitute the authoritative corpus of domain knowledge, overlapping validity periods often exist between successive versions, during which both remain applicable. Furthermore, outdated standards also embody the historical trajectory of technological development, providing invaluable references for research. Consequently, the update strategy of this knowledge base focuses on appending new documents and their respective text blocks, ensuring each newly added text block contains explicit version information of the standard. When fed into the large model, the most recent standard text blocks are assigned to be of higher priority to ensure that output results are based as much as possible on the latest available information. Specifically, concerning the vertical field of engineering technical specifications, the update mechanism of this knowledge base encompasses several key steps:

Coexistence of Multiple Versions: Whenever a new version is released or an existing standard is revised, the system stores it as an independent entity rather than simply overwriting the previous version. This approach takes into account both practical application timeliness and compatibility, such as the concurrent execution phases between different versions, while facilitating the tracking of technological evolution. Each document is annotated with metadata recording its effective date, obsolescence status, and revision summary for subsequent querying and tracing purposes.
Granular Mapping: Every text block will be accompanied by a version identifier, establishing a detailed mapping relationship from version to clause, which aids in the precise identification and citation of specific content under particular versions.
Priority Setting: Upon feeding updated text blocks into the large model, a prioritization mechanism can be implemented to give higher weight to the most recent standard text blocks. This method not only ensures responses are grounded in the latest knowledge but also enhances the system’s sensitivity and responsiveness to industry dynamic changes.

The aforementioned update strategy aims to construct a dynamic knowledge base that reflects current industry requirements while preserving historical perspectives, thereby supporting users in acquiring the most accurate and cutting-edge technical information. Moreover, this strategy provides a framework for future research aimed at exploring more effective ways of managing and utilizing standard literature resources.

5. Conclusions

This study proposes a domain-specific knowledge base framework for construction engineering based on LLMs and RAG, addressing the critical need for specialized knowledge management. By integrating entropy-optimized dynamic text segmentation, chunk rank based on vector correlation, and iterative optimization prompt engineering, we demonstrated an effective methodology for structuring and embedding engineering standards into LLMs, significantly enhancing the accuracy of domain-specific problem-solving while ensuring scalability and flexibility for industrial applications.

Firstly, by introducing LLMs and reviewing related works, the primary challenges and technological trends in knowledge management within the construction engineering domain are identified. Secondly, in Section 3, the design concept of the knowledge base framework is described in detail, with a particular focus on the entropy-optimized dynamic text segmentation method. This approach adaptively adjusts segmentation strategies based on the entropy of the text, aiming to clarify the information contained within text chunks as efficiently as possible. Furthermore, the chunk ranking algorithm based on vector correlation aids in filtering out the most pertinent chunk from a multitude of text fragments, while iteratively optimized prompting techniques further enhance the quality and accuracy of the LLM’s output.

In Section 4, a technical specification within the domain of construction engineering inspection was selected as the external knowledge text to validate the proposed methods. The experimental results demonstrated that the entropy-optimized dynamic text segmentation method improves the accuracy of answering specialized questions compared to these traditional methods. Additionally, the performance was compared across different scales of LLMs, revealing that while an increase in model parameter leads to enhanced performance, it also comes with a rise in computational costs.

In summary, the knowledge base proposed in this study is capable of integrating domain-specific knowledge into existing general LLMs, effectively addressing the issue of inaccurate retrieval of specialized domain questions. And it also provides robust support for subsequent engineering applications. Despite the progress made in this research, several challenges remain to be addressed in its application:

The integration of domain knowledge texts is not yet comprehensive and requires the gradual incorporation of numerous standards and specifications within the construction engineering field to form a complete domain knowledge base.
There is a need to integrate dynamic domain knowledge, such as construction logs, inspection technology reports, design notes, and so on.
Utilizing technologies like knowledge graph [55] or GraphRAG [56] to achieve improved information retrieval, contextual understanding, and personalized responses in query-focused tasks.

Exploring these research directions will propel the advancement of knowledge management in construction engineering towards being more intelligent, lightweight, and real-time, providing reusable technical paradigms for vertical fields such as architecture and civil engineering.

Author Contributions

Conceptualization, H.W. and J.L.; methodology, H.W.; software, H.W.; validation, D.Z., Z.F. and F.Z.; formal analysis, H.W.; investigation, D.Z.; resources, D.Z.; data curation, Z.F.; writing—original draft preparation, H.W.; writing—review and editing, D.Z. and Z.F.; visualization, H.W. and J.L.; supervision, H.W. and J.L.; project administration, J.L. and H.W.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by “National Key R&D Program of China” (2023YFC3804300).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

All authors were employed by the company CABR Testing Center Co., Ltd. and China Academy of Building Research Co., Ltd. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zheng, Z.; Zhou, Y.C.; Chen, K.Y.; Lu, X.Z.; She, Z.T.; Lin, J.R. A text classification-based approach for evaluating and enhancing the machine interpretability of building codes. Eng. Appl. Artif. Intell. 2024, 127, 107207. [Google Scholar] [CrossRef]
Lin, J.R.; Chen, K.Y.; Pan, P. Digital and intelligent standards for building and construction engineering: Current status and future. J. Southeast Univ. (Nat. Sci. Ed.) 2024, 55, 16–29. [Google Scholar]
Liu, Z.S.; Liu, J.J.; Ji, W.Y.; Liu, L. Research on the establishment and application of digital twin-based construction project delivery models. J. Build. Struct. 2024, 45, 97–106. [Google Scholar] [CrossRef]
Lin, J.R.; Chen, K.Y.; Zheng, Z.; Zhou, Y.; Lu, X. Key technologies and applications of intelligent interpretation of building engineering standards interpretation. Eng. Mech. 2025, 42, 1–14. [Google Scholar] [CrossRef]
Jiang, C.; Zheng, Z.; Liang, X.; Lin, J.; Ma, Z.; Lu, X. A new interaction paradigm for architectural design driven by large language model: Proof of concept with Rhino7. J. Graph. 2024, 45, 594–600. [Google Scholar] [CrossRef]
Qin, S.Z.; Zheng, Z.; Gu, Y.; Lu, X. Exploring and Discussion on the Application of Large Language Models in Construction Engineering. Ind. Constr. 2023, 53, 162–169. [Google Scholar] [CrossRef]
Guo, M.Z.; Zhang, X.X.; Zhao, L.L.; Zhang, Q.Y. Seismic Response Prediction of Structures using Large Language Models. Comput. Eng. Appl. 2024, 1–17. [Google Scholar] [CrossRef]
Florence, G.; Kikuchi, M.; Ozono, T. Delay risk detection in road construction projects utilizing large language models. In International Conference on Intelligent Systems Design and Applications; Springer: Cham, Switzerland, 2024. [Google Scholar] [CrossRef]
Jin, X.X.; Lin, X.; Yu, X.R.; Guo, H. Construction progress updating method based on BIM and large language models. Tsinghua Sci. Technol. 2025, 65, 35–44. [Google Scholar] [CrossRef]
Liang, J.; Zhang, L.P.; Yan, S.; Zhao, Y.; Zhang, Y. Research Progress of Named Entity Recognition Based on Large Language Model. J. Comput. Sci. Explor. 2024, 18, 2594–2615. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. 2023. Available online: https://cdn.openai.com/papers/gpt-4.pdf (accessed on 1 December 2024).
Gemma Team; Google DeepMind. Gemma: Open Models-Based on Gemini Research and Technology. Available online: https://arxiv.org/pdf/2403.08295 (accessed on 1 December 2024).
Meta. Llama 3: Foundation for the Next Generation of AI. 2024. Available online: https://www.llama.com (accessed on 3 February 2025).
Alibaba. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Baidu. ERNIE 4.0 launch event. 2023. Available online: https://wenxin.baidu.com/ernie (accessed on 1 February 2025).
iFLYTEK. iFlySpark. 2023. Available online: https://xinghuo.xfyun.cn (accessed on 1 February 2025).
Vatsal, S.; Dubey, H. A survey of prompt engineering methods in large language models for different nlp tasks. arXiv 2024, arXiv:2407.12994. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Shin, T.; Razeghi, Y.; Iv, R.L.L.; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing 2020, Online, 16–20 November 2020; pp. 4222–4235. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, H.N.G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1.1–1.35. [Google Scholar] [CrossRef]
Agarwal, U.; Tanmay, K.; Khandelwal, A.; Choudhury, M. Ethical reasoning and moral value alignment of llms depend on the language we prompt them in. arXiv 2024, arXiv:2404.18460. [Google Scholar] [CrossRef]
Jeong, J. Current research and future directions for off-site construction through LangChain with a large language model. Buildings 2024, 14, 2374. [Google Scholar] [CrossRef]
Shi, Z.B.; Zhu, L.Y.; Yue, X.Q. Material Information Extraction based on Local Large Language Model and Prompt Engineering. Data Anal. Knowl. Discov. 2024, 8, 23–31. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar] [CrossRef]
Li, X.; Wang, H.; Liu, Z.; Yu, S. Building a coding assistant via the retrieval-augmented language model. ACM Transactions on Information Systems. arXiv 2025, arXiv:2410.16229. [Google Scholar] [CrossRef]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. arXiv 2022, arXiv:2112.04426. [Google Scholar] [CrossRef]
Zhong, Y.; Leng, Y.; Chen, S.; Li, P.; Zou, Z.; Liu, Y.; Wan, J. Accelerating Battery Research with Retrieval-Augmented Large Language Models: Present and Future. Energy Storage Sci. Technol. 2024, 13, 3214–3225. [Google Scholar] [CrossRef]
Wang, H.Q.; Wei, J.; Jing, H.Y.; Song, H.; Xu, B. Meta-RAG: A Metadata-Driven Retrieval Augmented Generation Framework for the Power Industry. Comput. Eng. 2025, 1–11. [Google Scholar] [CrossRef]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020, Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar] [CrossRef]
Lee, K.; Chang, M.-W.; Toutanova, K. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL) 2019, Florence, Italy, 28 July–2 August 2019; pp. 6086–6096. [Google Scholar] [CrossRef]
Dai, Z.; Callan, J. Context-aware document term weighting for ad-hoc search. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 2020, Online, 16–20 November 2020; pp. 49–58. [Google Scholar] [CrossRef]
Zhao, Y.; Jiang, F.; Li, P. A Bert-Based Hierarchical Adjacent Coherence Text Segmentation Method. Comput. Appl. Softw. 2024, 41, 262–268. [Google Scholar] [CrossRef]
Liu, Y.; Lapata, M. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2019, Hong Kong, China, 3–7 November 2019; pp. 3730–3740. [Google Scholar] [CrossRef]
Koshorek, O.; Cohen, A.; Mor, N.; Rotman, M.; Berant, J. Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Volume 2. [Google Scholar] [CrossRef]
Jin, Y.; Huang, J. Improved TFIDF algorithm based on information entropy and word length information. J. Zhejiang Univ. Technol. 2021, 49, 203–209. [Google Scholar] [CrossRef]
Genzel, D.; Charniak, E. Entropy Rate Constancy in Text; Association for Computational Linguistics: Dublin, Ireland, 2002. [Google Scholar] [CrossRef]
Tang, Y.; Deng, J.; Guo, Z. Candidate term boundary conflict reduction method for Chinese geological text segmentation. Appl. Sci. 2023, 13, 4516. [Google Scholar] [CrossRef]
Zhou, X.; Gao, Y.Q.; Fan, J.Y. Research on patent retrieval strategy based on BERT word embedding. J. China Soc. Sci. Tech. Inf. 2023, 42, 1347–1357. [Google Scholar] [CrossRef]
Sufi, F. Generative pre-trained transformer (GPT) in research: A systematic review on data augmentation. Information 2024, 15, 99. [Google Scholar] [CrossRef]
Yang, Y.; Ye, F.; Xu, D.; Zhang, X.; Xue, J. Construction of digital twin water conservancy knowledge graph integrating large language models and prompt learning. J. Comput. Appl. 2024, 1–11. [Google Scholar] [CrossRef]
Zhang, H.Y.; Wang, X.; Han, L.F.; Li, Z.; Chen, Z.; Chen, Z. Research on Question Answering System on the Joint of Knowledge Graph and Large Language Models. J. Front. Comput. Sci. Technol. 2023, 17, 2377–2388. [Google Scholar] [CrossRef]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Wang, S.; He, W.; Wang, F.; Zhao, X.; Zhou, Y. Research on question answering system based on large language model integrating knowledge graph and vector retrieval. Sci. Technol. Eng. 2024, 24, 13902–13910. [Google Scholar] [CrossRef]
Wu, G.D.; Qin, H.; Hu, Q.X.; Wang, X.N.; Wu, Z.C. Research on large language models and personalized recommendation. CAAI Trans. Intell. Syst. 2024, 19, 1351–1365. [Google Scholar] [CrossRef]
Zhao, J.F.; Chen, T.; Wang, X.M.; Feng, C. Information Extraction of Unlabeled Patent Based on Knowledge Self-Distillation of Large Language Model. Data Anal. Knowl. Discov. 2025, 8, 133–143. [Google Scholar] [CrossRef]
Jun, F.E.; Yanghong, C.H.; Jiamin, L.U.; Hailin, T.A.; Zhipeng, L.Y.; Yuchun, Q.I. Construction and Application of Knowledge Graph for Water Engineering Scheduling Based on Large Language Model. J. Front. Comput. Sci. Technol. 2024, 18, 1637–1647. [Google Scholar] [CrossRef]
Peng, W.; Wu, H.; Xu, L. Keyword weight optimization for short text multi-classification based on attention mechanism. J. Comput. Appl. 2021, 41, 19–24. [Google Scholar] [CrossRef]
JGJ/T 23-2011; Technical Specification for Inspecting of Concrete Compressive Strength by Rebound Method. Ministry of Housing and Urban-Rural Development of the People’s Republic of China: Beijing, China, 2011.
Nussbaum, Z.; Morris, J.X.; Duderstadt, B.; Mulyar, A. Nomic embed: Training a reproducible long context text embedder. arXiv 2024, arXiv:2402.01613. [Google Scholar] [CrossRef]
Nomic AI. Nomic-Embed-Text: Reproducible and Transparent Open-Source Text Embeddings. 2024. Available online: https://www.nomic.ai/ (accessed on 1 February 2025).
Li, H.; Yang, R.; Xu, S.; Xiao, Y.; Zhao, H. Intelligent checking method for construction schemes via fusion of knowledge graph and large language models. Buildings 2024, 14, 2502. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024. [Google Scholar] [CrossRef]

Figure 1. Domain knowledge base framework.

Figure 2. Schematic diagram of EDTS.

Figure 3. Basic prompt.

Figure 4. Prompt with context integration.

Figure 5. Prompt with instruction optimization and example.

Figure 6. Type distribution of test questions.

Figure 7. Comparison of different segmentation methods (LLM = Qwen2.5:7b, top_k = 10).

Figure 8. Accuracy comparison under different scales of LLMs.

Figure 9. Comparison of execution runtime under different scales of LLMs.

Table 1. Some major LLMs.

No.	Model Name	Developer	Open Source	Release Date	Max Parameters	Key Features
1	ChatGPT [15]	Open AI	No	2022.11	N/A	General-purpose dialogue, advanced reasoning, and long-text generation.
2	Gemma [16]	Google DeepMind	Yes	2024.2	7B	Lightweight open-source model with safety and efficiency focus.
3	Llama [17]	Meta	Yes	2023.7	400B	Community-driven open-source model, supports multilingual tasks.
4	Qwen [18]	Alibaba Cloud	Yes	2023.8	110B	Chinese-optimized, excels in math and coding tasks.
5	ERNIE Bot [19]	Baidu	No	2023.3	N/A	Leading Chinese NLP performance, integrated search enhancement.
6	iFlySpark [20]	iFLYTEK	No	2023.5	N/A	Multimodal interaction, specialized in education and healthcare scenarios.

Notes: “N/A” indicates parameter sizes not explicitly disclosed by developers.

Table 2. Some text segmentation methods.

No	Method	Implementation Summary	Major Deficiency
1	Fixed-size chunking	Define a fixed character or word limit (e.g., 512 or 128 words per chunk). Split the text into equal-sized chunks.	Semantic fragmentation, uneven information density [34]
2	Sliding window chunking	Define a window size (e.g., 256 words) and a step size (e.g., 64 words). Slide the window across the text to create overlapping chunks.	Redundant information, parameter sensitive [35,36]
3	Paragraph-based chunking	Split text at paragraph boundaries (e.g., line breaks).	Dependence on formatting quality, uneven paragraph length [37]
4	Sentence-based chunking	Split text at sentence boundaries (e.g., periods or question marks).	Punctuation dependency, information fragmentation [38]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Zhang, D.; Li, J.; Feng, Z.; Zhang, F. Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base. Appl. Sci. 2025, 15, 3134. https://doi.org/10.3390/app15063134

AMA Style

Wang H, Zhang D, Li J, Feng Z, Zhang F. Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base. Applied Sciences. 2025; 15(6):3134. https://doi.org/10.3390/app15063134

Chicago/Turabian Style

Wang, Haiyuan, Deli Zhang, Jianmin Li, Zelong Feng, and Feng Zhang. 2025. "Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base" Applied Sciences 15, no. 6: 3134. https://doi.org/10.3390/app15063134

APA Style

Wang, H., Zhang, D., Li, J., Feng, Z., & Zhang, F. (2025). Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base. Applied Sciences, 15(6), 3134. https://doi.org/10.3390/app15063134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropy-Optimized Dynamic Text Segmentation and RAG-Enhanced LLMs for Construction Engineering Knowledge Base

Abstract

1. Introduction

2. Related Work

2.1. Large Language Model

2.2. Prompt Engineering

2.3. Retrieval-Augmented Generation

3. Methodology

3.1. Domain Knowledge Base Framework

3.2. Entropy-Optimized Dynamic Text Segmentation

3.3. Chunk Rank Based on Vector Correlation

3.4. Iterative Optimization of Prompt Engineering

4. Test Results and Discussion

4.1. Test Condition

4.2. Test Result Evaluation Method

4.3. Test Results

4.4. Update Strategy

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI