An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts

Li, Bingjie; Wen, Haolin; Wang, Songyi; Hu, Tao; Liang, Xin; Luo, Xing

doi:10.3390/electronics14142856

Open AccessArticle

An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts

by

Bingjie Li

^1,†,

Haolin Wen

^1,†

,

Songyi Wang

^2,3,†,

Tao Hu

^1,*,

Xin Liang

^1,* and

Xing Luo

^2,*

¹

Department of Management Engineering and Equipment Economics, Naval University of Engineering, Wuhan 430033, China

²

Peng Cheng Laboratory, Shenzhen 518055, China

³

Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(14), 2856; https://doi.org/10.3390/electronics14142856

Submission received: 31 May 2025 / Revised: 4 July 2025 / Accepted: 9 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Intelligent Approaches for Solving Software Problems with AI Techniques)

Download

Browse Figures

Versions Notes

Abstract

To enhance the accuracy and reliability of large language models (LLMs) in regulatory question-answering tasks, this study addresses the complexity and domain-specificity of regulatory texts by designing a retrieval-augmented generation (RAG) testing framework. It proposes a dimensionality reduction-based semantic similarity measurement method and a retrieval optimization approach leveraging information reasoning. Through the construction of the technical route of the intelligent knowledge management system, the semantic understanding capabilities of multiple mainstream embedding models in the text matching of financial regulations are systematically evaluated. The workflow encompasses data processing, knowledge base construction, embedding model selection, vectorization, recall parameter analysis, and retrieval performance benchmarking. Furthermore, the study innovatively introduces a multidimensional scaling (MDS) based semantic similarity measurement method and a question-reasoning processing technique. Compared to traditional cosine similarity (CS) metrics, these methods significantly improved recall accuracy. Experimental results demonstrate that, under the RAG testing framework, the mxbai-embed-large embedding model combined with MDS similarity calculation, Top-k recall, and information reasoning effectively addresses core challenges such as the structuring of regulatory texts and the generalization of domain-specific terminology. This approach provides a reusable technical solution for optimizing semantic matching in vertical-domain RAG systems, particularly for MDSs such as law and finance.

Keywords:

large language model; retrieval-augmented generation; multidimensional scaling; semantic matching

1. Introduction

In recent years, with the rapid development of LLMs, increasing attention has been paid by both academic and industrial communities to the accuracy of their generated content. Although LLM performance has improved through parameter scaling and training optimization, their outputs have still frequently exhibited factual inaccuracies and outdated domain knowledge. To address these issues, RAG [1,2,3] has emerged as a promising solution and has been increasingly integrated with LLMs. By incorporating dynamic retrieval mechanisms from external knowledge bases, RAG enables LLMs to access real-time, domain-relevant knowledge, thereby mitigating factual errors and significantly improving output reliability.

RAG has evolved from basic implementations (e.g., “Naive RAG”) to advanced architectures such as “Modular RAG” incorporating retrieval optimization techniques and multi-module fusion. These methodologies enhance response accuracy and timeliness within intelligent question-answering systems, including applications in customer service and document generation. However, the effectiveness of RAG is critically dependent on comprehensive knowledge base construction and optimized retrieval pipeline design. This requires multi-stage optimization, including high-quality data source curation, domain-adapted text chunking, and precise semantic retrieval algorithms. In high-stakes domains with strict regulatory requirements, such as law and finance, the accuracy, completeness, and logical rigor of generated outputs directly affect decision-making risks, placing greater performance demands on RAG systems. Compared to general domains, knowledge management in vertical domains has presented three core challenges: knowledge base construction [4], semantic retrieval [5,6], and generation control.

The knowledge base construction process is complex. Professional texts such as financial regulations [7] exhibit high clause interconnectivity and logical rigor. Conventional text segmentation methods [8] tend to fracture critical contextual relationships, resulting in fragmented retrieval outcomes. Concurrently, the high-frequency update characteristic of regulatory provisions requires the knowledge base to support real-time updates and version control capabilities. Recent improvements in knowledge base construction are primarily achieved through two approaches: structured data storage [4,9] and verification of clause-level logical consistency. Common knowledge processing methods include, but are not limited to, text chunking, information extraction [10,11,12], knowledge graphs [13], and Knowledge-Augmented Generation (KAG). Optimization methods for knowledge base construction have also been devised. For instance, dynamic semantic segmentation (such as dependency parsing in the txtai library [14]) has been adopted to address the strong logical coherence of regulatory provisions. This approach has progressively replaced fixed-length text chunking [15] with dynamic semantic segmentation techniques. This method applies dynamic segmentation to knowledge base texts, where clause boundaries are identified through dependency parsing using frameworks such as SpaCy 3.0.0, as proposed by Kumar et al. [16,17]. This ensures logical units maintain integrity as discrete chunks during storage. Regarding knowledge structure construction, multi-hop clause citation networks are developed by integrating information extraction with knowledge graphs [18]. Zuo et al. [19] demonstrated this approach by constructing datasets using the Scrapy framework and storing knowledge graphs in Neo4j, thereby establishing traceable knowledge-augmented structures [20]. Liang et al. [21] proposed KAG, which enhances the synergy between knowledge representation and retrieval through a logic-guided hybrid reasoning engine. Concurrently, Jiang et al. [22] introduced knowledge-augmented dialog generation, which employs a divergent knowledge selector to pre-optimize candidate knowledge and a knowledge-aware decoder to generate contextually coherent, knowledge-rich responses. To ensure the timeliness of regulatory knowledge bases, researchers have introduced a version control mechanism. This approach utilizes twin networks (e.g., MERIT) where the target network caches historical clause representations, achieving smooth version transitions through momentum updates [23] while preventing the inclusion of repealed provisions in the knowledge base. Under this mechanism, outdated provisions are tagged as “abrogated” and replaced by updated versions, enabling automated identification of content modifications during regulatory revisions [24,25].
Semantic retrieval accuracy in general-domain technical solutions is constrained. General-domain embedding models [26] exhibit limited representational capacity for vectorizing domain-specific terminology. The implicit logical complexity of regulatory provisions poses significant challenges for both high-dimensional sparse vector representations [27] and traditional cosine similarity algorithms [28,29], leading to critical omissions during regulatory queries and potentially invalidating analytical outcomes. Current research primarily concentrates on domain-adapted embedding models and improvements in retrieval accuracy [30]. General-domain embedding models exhibit constrained representational capacity for relevant terminology, resulting in insufficient retrieval accuracy. Additionally, different embedding models demonstrate varied capacities for representing domain-specific terms [31], which necessitates comparative testing of domain-adapted models [32] (e.g., bge-m3, mxbai-embed-large) to improve matching precision and enhance comprehension of context-specific semantics. Furthermore, Weller et al. [5,33,34,35,36] proposed a reranking model that enables multi-level semantic reranking, thereby improving the precision of retrieved passages. Simultaneously, Bai et al. [37] developed a multi-hop reasoning approach incorporating relative temporal encoding to enhance path retrieval capability for complex regulatory logic.
Stringent demands are imposed on response generation control. The generation phase must strictly adhere to inter-provision citation logic, and outputs must be traceable and interpretable to ensure credibility. Regarding generation control, current research efforts are focused on domain-constrained generation [38] and interpretability enhancement [39]. Prompt engineering optimization techniques, exemplified by the Collaborative Legal Expert Framework [40,41], ensure that generated content complies with regulatory formatting and citation logic through intent identification and legal foundation tracing. Additionally, a multi-tier posterior verification mechanism has been introduced [42] to validate provision validity through knowledge graph backtracking. This is further integrated with reasoning path visualization tools, such as ReasonGraph developed by Li et al. [43], which generated reasoning graphs that visualize the full trace: “problem keywords” → “retrieved provisions” → “generation basis”, thereby significantly enhancing the traceability of generated results [44] while markedly improving their credibility and auditability.

A comprehensive analysis of current research reveals a significant disparity between generic RAG frameworks and the temporal accuracy requirements of regulatory knowledge bases. Key limitations include insufficient semantic retrieval precision, inadequate traceability and reliability of generated results, and constrained flexibility in debugging and selecting embedding models. Consequently, research on specialized RAG testing frameworks and optimization methods for regulatory applications remains notably underdeveloped. To address the limited applicability of generic solutions in the regulatory domain, it is essential to fully consider the structural characteristics and domain-specific semantics of regulatory texts, thereby necessitating the development of collaborative retrieval-generation testing and optimization frameworks with enhanced domain adaptability.

Based on these challenges, this study proposes an optimized semantic matching method for regulatory texts and a dedicated RAG testing framework. Specifically, by focusing on RAG testing and semantic matching optimization for regulatory texts, this study investigates the principles of domain-specific adaptation for retrieval-augmented generation technology. A comparative analysis of multiple knowledge base processing strategies and embedding model performance is conducted, providing critical theoretical foundations and methodological guidance for the design of domain-specific RAG systems. Existing research indicates the absence of open-source RAG testing frameworks or software platforms specifically designed for the regulatory domain. The regulatory RAG testing framework developed in this study achieves comprehensive optimization throughout the entire workflow, from knowledge base construction to generation verification. This solution not only resolves technical constraints in regulatory retrieval but also establishes a reusable technical paradigm for developing domain-specific RAG systems.

The main contributions are summarized as follows:

An open-source regulatory RAG testing framework has been constructed. Multiple embedding models are deployed to preprocess knowledge fragments, with similarity computation strategies configured to facilitate test in regulatory text semantic matching.
A semantic matching optimization method based on dimensionality reduction has been proposed. Within the RAG testing framework, semantic matching accuracy for regulatory texts has been enhanced through denoising high-dimensional sparse vectors.
Developed a retrieval optimization method based on information reasoning. By inferring potential questions, this approach enhances the accuracy of knowledge fragment matching.
The applicability of different knowledge base processing methods and retrieval strategies for regulatory question-answering tasks has been validated using real-world regulatory datasets.

The remainder of this paper is organized as follows. Section 2 introduces the design and functionality of the RAG testing framework. Section 3 compares test data from multiple embedding models. Section 4 proposes a dimensionality reduction-based similarity computation method and evaluates the accuracy of different similarity metrics using various embedding models. Section 5 presents an information reasoning-based retrieval approach, and analyzes its applicability through real-world dataset testing. Section 6 concludes the paper and outlines potential applications and future research directions.

2. RAG Testing Framework Design

To support RAG testing and evaluate the effectiveness of different semantic matching approaches for regulatory texts, this section extends the AI-KM 5.4.18 [41] platform, which already integrates functionalities such as multi-view knowledge management and LLM invocation. In response to domain-specific requirements in the regulatory field, additional modules are developed, including knowledge preprocessing, RAG orchestration, cosine similarity computation, dimensionality reduction-based similarity calculation, and visualization components. The software architecture, centered on RAG testing, is illustrated in Figure 1.

Figure 1 demonstrates the latest functional design of the RAG testing framework. Addressing domain-specific requirements for regulatory text processing, core capabilities including LLM deployment, knowledge preprocessing, and similarity computation have been implemented through enhancements centered around RAG functionality.

LLM Deployment. The software implements localized deployment and invocation of latest LLM through the Ollama 0.9.5 platform, with support extended to general-purpose language models (e.g., Deepseek-R1, LLaMA3.3), embedding models including bge-m3 and nomic-embed-text, and multimodal models such as Gemma3. This architecture enables rapid integration of newly released LLM products while reducing access barriers via standardized deployment interfaces, ensuring operational simplicity and flexibility in model deployment.
Knowledge Base Management. The software converts structured text into multiple interactive interfaces such as tree diagrams, tag-based classification views, knowledge graphs, and mind maps by parsing Markdown documents. Efficient browsing, editing, and maintenance of the knowledge base are enabled through these views. Additionally, the knowledge base can be subjected to various preprocessing operations, including intelligent segmentation of Markdown documents, vectorization using custom embedding models, and semantic enhancement through information reasoning.
RAG Module. The integration of embedding models with similarity computation enables high-precision knowledge retrieval, while domain-specific answers are generated by LLM based on the knowledge base. Fundamental matching between queries and knowledge fragments is implemented through cosine similarity, with semantic information reasoning being further enhanced through latent semantic information extraction. For a given knowledge segment, the software can automatically infer relevant information or expand contextual meaning through manual annotation, followed by similarity computation that combines original text with information reasoning. Additionally, the integrated MDS-based semantic similarity measurement method enhances semantic matching capability by eliminating interference from irrelevant semantic dimensions.
Evaluation and Debugging Module. The software implements visual debugging, supporting the graphical display of both cosine similarity computation results and Dimensionality reduction results. Configuration of slicing strategies, switching between embedding models, comparison of similarity algorithms and hidden information editing capabilities are enabled to establish a debugging workflow covering parameter configuration, experiment execution, result visualization, and performance comparison, providing intuitive support for technical optimization decisions.

The corresponding knowledge base processing and RAG testing framework interfaces within the software are presented in Figure 2.

Figure 2 presents the data processing and debugging interface of the RAG testing framework. Users can input queries, view generated responses, and visually inspect the associations between knowledge base fragments and corresponding questions. Parameters such as slicing strategies, embedding models, and similarity computation methods can be adjusted, and the processes of cosine similarity computation and MDS-based semantic similarity measurement are visualized to support transparent evaluation and debugging.

Considering the structured nature of regulatory texts, which typically include nested clauses, domain-specific terminology, and complex logical relationships, a multi-stage knowledge base processing framework has been developed. This framework builds upon the functionalities outlined in Figure 1 and enhances retrieval precision by configuring slicing strategies, embedding models, information reasoning and editing modules, and similarity computation methods at distinct processing stages. It also improves software operability and enhances retrieval visualization capabilities. The complete workflow for knowledge base processing and RAG evaluation is illustrated in Figure 3.

Figure 3 illustrates the knowledge base processing and RAG workflow, comprising the following stages.

The directory containing the target knowledge base is selected. The knowledge base primarily consists of files in Markdown format. Documents in other formats, such as PDF, can be converted using open-source tools such as MinerU 0.4.1.0. Manual verification and editing of Markdown files may be conducted to improve textual accuracy and consistency.
The knowledge base is segmented using predefined delimiters, such as consecutive line breaks. If the segmentation results are determined to be suboptimal, manual refinement of the Markdown structure can be applied to enhance the segmentation quality.
Vectors corresponding to each segmented knowledge fragment are computed using the specified embedding model and stored in the database.
For each knowledge fragment, implicit semantic information is extracted using the configured large language model. By default, this process involves prompt-based question generation to derive potential questions associated with the corresponding content. Alternatively, customized prompt instructions can be applied to elicit semantically enriched outputs. The extracted information may also be manually refined to enhance the semantic granularity of key content. (This step is optional and may be selectively executed depending on task requirements.)
Vectors are computed for both the original knowledge fragments and the associated implicit information using the designated embedding model, and subsequently stored in the database.
For user-submitted queries, similarity computations are performed using either cosine similarity or dimensionality reduction based on MDS. Vectors corresponding to the original knowledge fragments, implicit information, and queries are calculated and ranked in descending order according to similarity scores.
Relevant knowledge fragments are retrieved according to one of three predefined recall strategies: similarity-based recall, top-k recall, or length-constrained recall. Prompt templates are constructed accordingly, and each knowledge fragment is annotated with metadata such as regulatory source and temporal information to improve interpretability and ensure that updated regulations maintain backward traceability.
The finalized prompts are processed by the LLM to generate responses aligned with the regulatory knowledge base, thereby enhancing domain relevance and answer precision.

3. Embedding Models Testing

To evaluate the performance of various embedding models within the RAG testing framework and determine the most suitable model for semantic matching of regulatory texts, several embedding models are selected from the Ollama model library for benchmarking. The results provide guidance for optimal embedding model selection and configuration in domain-specific RAG systems.

Embedding models convert textual content into semantic representations in high-dimensional vector space [45,46]. By computing the cosine similarity between user queries and knowledge fragments, the most semantically relevant fragments can be retrieved to support accurate domain-specific answer generation. However, different embedding models vary significantly in their ability to capture general versus domain-specific semantics. To evaluate the performance of different embedding models, the open-source platform Ollama is adopted, which supports efficient installation and deployment of diverse embedding models through standardized interfaces.

Based on model maturity and practical applicability, four embedding models are selected for performance benchmarking using the proposed RAG testing framework. The characteristics of these models are summarized in Table 1.

To evaluate the suitability of different embedding models for semantic matching in the regulatory texts, an experimental procedure is designed to test the models’ ability to interpret regulatory provisions. The experimental setup is defined as follows:

The knowledge base incorporates regulations such as the “Government Procurement Goods and Services Bidding Management Measures” for preliminary testing of LLM and embedding model performance. Multiple embedding models are selected for comparative evaluation.
The test question dataset comprised 50 financial regulation domain questions. This dataset is generated using DeepSeek-V3-0324 based on the knowledge base, followed by manual verification.
The selected embedding models are listed in Table 1. These models are evaluated using a streamlined regulatory corpus to assess their performance.
Cosine similarity is employed to calculate the similarity between user-submitted questions and knowledge base segments. The matching accuracy of embedding models is evaluated by setting recall thresholds. A question is considered correctly answered if its corresponding knowledge segment is matched under the specified recall threshold, with accuracy rates are calculated accordingly.
The framework implemented three distinct retrieval strategies: (1) similarity-based retrieval, where knowledge segments are recalled when their similarity scores exceed a predetermined threshold; (2) Top-k retrieval, which selects a fixed number of top-ranked knowledge segments sorted by descending similarity; and (3) length-constrained retrieval, where segments are similarly ranked by similarity but are dynamically aggregated through prompt engineering until reaching a predefined character count limit.

Following the aforementioned testing conditions, the collected data are systematically organized and subjected to one-way analysis of variance (ANOVA) across the three datasets. The F-value represents the ratio of between-group mean differences to within-group variations, while the p-value indicates the probability of observed differences occurring by chance. A p-value below 0.05 is considered statistically significant. The resulting retrieval accuracy rates are presented in the table below.

The table presented test results of various embedding models on regulatory domain question-answering data. Based on the data from Table 2, Figure 4 is generated.

Under the similarity-based recall strategy, the accuracy differences among models demonstrate high statistical significance (F = 24.41, p < 0.001), indicating that model selection significantly impacts retrieval performance. The mxbai model achieves optimal performance, maintaining 100% accuracy across all thresholds, while the bge-m3 and all-minilm model exhibit poorer results, requiring threshold reduction to recall sufficient segments. Analysis reveals substantial variations in similarity score distributions among models (Figure 4: Similarity-based Recall): the mxbai model consistently generates higher similarity scores, enabling effective recall even under strict thresholds. In contrast, bge-m3 shows only 18% accuracy at high thresholds (e.g., 70%), requiring threshold reduction to 46% for performance improvement. This finding demonstrates that the evaluation results of similarity-based recall strategies are highly dependent on the inherent similarity distribution characteristics of each model, making direct cross-model performance comparisons impractical.

Under the Top-k recall strategy, there are no statistically significant differences in accuracy among the models (F = 0.57, p = 0.65), yet their performance rankings remain consistent: mxbai > all-minilm > nomic > bge-m3. The mxbai model demonstrates optimal overall performance when Top-k recall is fixed, though the performance gap with other models narrows significantly (Figure 4: Top-k Recall). By directly controlling the number of retrieved segments, this strategy eliminates interference from similarity distribution disparities, thereby more equitably reflecting models’ ranking capabilities. Consequently, for scenarios requiring a balance between recall and precision, the Top-k recall strategy is recommended as the primary evaluation approach.

Under the length-constrained recall strategy, the performance differences among the models further diminish (F = 1.53, p = 0.245), and all models exhibit relatively low correct-answer rates (Figure 4: Length-constrained Recall). This suggests that length-constrained recall may impair retrieval effectiveness, possibly due to truncating critical information or introducing noise—particularly impacting high-precision models like the mxbai model more significantly. Although the mxbai model maintained a slight advantage, the practical utility of this strategy remains limited and is only recommended for scenarios with strict text-length constraints.

The comprehensive comparison demonstrates that the mxbai model consistently achieves optimal performance across all three strategies, making it particularly suitable for high-precision regulatory retrieval tasks. When employing similarity-based recall, model-specific threshold optimization is recommended. In contrast, the Top-k recall strategy, owing to its superior stability and fairness, is better suited as a benchmark method for model performance evaluation. Length-constrained recall requires cautious implementation to avoid performance degradation caused by information loss. In practical applications, strategy selection should be demand driven: for high-stakes scenarios, priority should be given to the mxbai model paired with similarity-based recall, while Top-k recall is recommended for general evaluation to enhance result comparability.

4. Semantic Matching Optimization Methodology

Within the RAG testing framework for embedding model configuration strategies, this section proposes a semantic matching optimization method based on dimensionality reduction results. The method enhances conventional cosine similarity computation through systematic testing of alternative similarity calculation strategies, with rigorous analysis conducted to validate the effectiveness of the proposed dimensionality-reduction-enhanced similarity metric for semantic matching tasks.

For high-dimensional sparse vector matching, dimensionality reduction techniques, such as Uniform Manifold Approximation and Projection (UMAP), t-Distributed Stochastic Neighbor Embedding (t-SNE), MDS, and Principal Component Analysis (PCA) [51,52,53,54,55], serve as an effective approach for vector visualization. These methods transform high-dimensional vectors into low-dimensional representations while preserving their primary distance relationships. By reducing noise from irrelevant dimensions in semantic understanding, they consequently enhance retrieval accuracy. Among these dimensionality reduction methods, t-SNE and UMAP prioritize local structure preservation at the expense of global distance comparability. For regulatory semantic matching tasks—where understanding the overall spatial relationships, relative positions, and distance metrics between queries and knowledge segments is crucial—classical MDS and PCA yield more reliable and interpretable distance representations. Furthermore, while t-SNE’s strong dependence on random initialization leads to inconsistent outputs across multiple runs, and UMAP (though more stable than t-SNE) still exhibits result variability, both MDS and PCA guarantee deterministic and reproducible similarity results. Consequently, dimensionality reduction methods based on MDS and PCA are selected for experimental validation.

4.1. Cosine Similarity

In conventional RAG frameworks, the standard approach involves transforming text into vector representations using embedding models. Specifically, the user’s query (denoted as vector

q

) and knowledge segments (each represented as vector

d_{i}

for the

i

segment) undergo cosine similarity computation. The semantic similarity measurement between two vectors is calculated as follows:

C S (q, d_{i}) = \frac{q \cdot d_{i}}{‖q‖ \cdot ‖d_{i}‖}

(1)

4.2. MDS

The core principle of MDS-based semantic similarity measurement lies in preserving the distance relationships between high-dimensional data points within a low-dimensional space, reconstructing the data using a reduced-dimension distance matrix. This approach aligns with text semantic matching requirements, where the objective is to capture core semantic relationships while eliminating noise from peripheral semantics. Consequently, the optimization strategy for semantic matching involves projecting high-dimensional retrieval vectors into a lower-dimensional space while maintaining their original distance relationships, thereby reducing interference from irrelevant dimensions and improving both the accuracy and efficiency of semantic matching. Within the RAG testing framework, Figure 5 illustrates the computational workflow that employs the classical MDS algorithm for similarity measurement.

The operational steps of the method using MDS for semantic similarity measurement are as follows:

During the knowledge base processing phase, structured Markdown documents are utilized as the primary data source. These documents are segmented into text fragments based on predefined rules, such as paragraph boundaries, section headings, or fixed-length chunking. Each text fragment is then transformed into a high-dimensional vector $d_{i}$ using designated embedding models (for example, mxbai-embed-large or bge-m3), which are designed to capture deep semantic features. The resulting vectors are stored in a vector database along with their corresponding text fragments to support efficient similarity-based retrieval. To preprocess n vectors of dimension m, a cosine distance matrix $D \in R^{n \times n}$ is computed, where each element is given by the following:

$D_{i j} = 1 - C S (v_{i}, v_{j}) = 1 - \frac{v_{i} \cdot v_{j}}{‖v_{i}‖ ‖v_{j}‖}$

(2)
The classical MDS method is then applied for dimensionality reduction. A centering matrix is constructed as follows:

$H = I_{n} - \frac{1}{n} 1 1^{T}$

(3)

and used to double centering the squared distance matrix $D$ , yielding the inner product matrix:

$B = - \frac{1}{2} H D^{(2)} H$

(4)

This transformation ensures that both the row and column sums of the matrix $B$ are zero, which eliminates dependence on the coordinate origin and facilitates subsequent eigen decomposition. In this context, $I_{n}$ denotes the identity matrix of dimension $n \times n$ ; $1$ is a column vector of ones with dimension $n \times 1$ ; $1 1^{T}$ is an $n \times n$ matrix with all entries equal to one; and $D^{(2)}$ represents the element-wise squared distance matrix. Matrix $B$ is then subjected to eigendecomposition:

$B = U Λ U^{T}$

(5)

where $Λ$ is the diagonal matrix of eigenvalues. The eigenvectors associated with the two largest eigenvalues (ranked in descending order) are selected to construct the projection matrix. The low-dimensional coordinate $d_{i}^{M D S}$ of each data point is obtained from the corresponding row of the projection matrix. This approach is justified by the fact that eigenvalues quantify the variance explained along each dimension. Larger eigenvalues correspond to directions with greater variance (thus carrying more significant semantic information), while smaller eigenvalues correspond to less informative directions, which can be interpreted as semantic noise. Selecting only the top eigenvectors enables preservation of the most meaningful geometric relationships while suppressing irrelevant components.
The semantic similarity between texts can be obtained by calculating the Euclidean distances between different coordinate points. Following the same methodology, the embedding vector corresponding to the user’s query (denoted as $q$ undergoes centralization and projection transformation to be mapped into the low-dimensional space as $q^{M D S}$ . This process effectively preserves the original spatial distance relationships in the reduced-dimensional space. Finally, the semantic similarity between the retrieved knowledge segments and the user’s query is determined through distance computation, with detailed operations specified in Formula (6).

$M D S (q, d_{i}) = \frac{1}{1 + {‖q^{M D S} - d_{i}^{M D S}‖}_{2}}$

(6)

Figure 6 illustrates a schematic representation of the similarity results derived from MDS-based calculations between retrieved knowledge fragments and user queries.

Figure 6 provides a two-dimensional visualization on the left, illustrating the semantic distances between the user query and each individual knowledge fragment. Notably, the clause directly relevant to the query yields a similarity score of 67.17%, ranking second among all retrieved segments. According to the configured Top-k recall threshold, the tender process query is not only successfully linked to the most relevant regulatory clause, but the system also demonstrates the ability to effectively extract and summarize the complete tendering procedure based on retrieved content.

4.3. Principal Component Analysis

In RAG systems, the principal objective of PCA is to project high-dimensional embedding vectors into a lower-dimensional space through linear transformation, preserving the most significant semantic features while reducing noise interference. Similarly to MDS-based semantic similarity measurement, PCA achieves dimension reduction by retaining the principal variance directions of the data. However, PCA directly decomposes the covariance matrix of the original vectors rather than a distance matrix, making it more suitable for processing linearly structured data. The implementation procedure consists of the following steps:

During the knowledge base processing phase, structured Markdown documents are segmented into text fragment collections $D = {d_{1}, d_{2}, . . ., d_{n}}$ according to predefined rules. Each text fragment $d_{i}$ is then transformed into a high-dimensional semantic vector $v_{i} = E m b e d d i n g (d_{i}) \in R^{m}$ using an embedding model.
The computational procedure involves the following steps: first, compute the mean vector and perform centralization: $\bar{v} = \frac{1}{n} \sum_{i = 1}^{n} v_{i}, v_{i}' = v_{i} - \bar{v}$ , constructing the centralized matrix $V' \in R^{n \times m}$ .
Next, calculate the covariance matrix: $\sum = \frac{1}{n - 1} {V'}^{T} V' \in R^{m \times m}$ . Subsequently, decompose the covariance matrix: $\sum = P Λ P^{T}$ where $Λ = d i a g (λ_{1}, λ_{2}, \dots, λ_{m})$ and $P$ is the eigenvector matrix. The target dimension $k$ is determined based on the cumulative explained variance ratio. Finally, generate the dimensionality reduction projection matrix $P_{k} \in R^{m \times k}$ by selecting the top k eigenvectors, and project into the lower-dimensional space: $v_{i}^{P C A} = P_{k}^{T} v_{i}' \in R^{k}$ .
For user query q, its embedding vector is similarly generated, and dimensionality reduction is applied: $q^{P C A} = P_{k}^{T} (E m b e d d i n g (q) - \bar{v})$ . In the low-dimensional space, retrieval is performed using cosine similarity: $P C A (q, d_{i}) = \frac{q^{P C A} \cdot d_{i}^{P C A}}{‖q^{P C A}‖ \cdot ‖d_{i}^{P C A}‖}$ .

4.4. Experimental Analysis

The semantic matching accuracy of CS, MDS, and PCA methods is evaluated using identical query datasets and regulatory knowledge bases, with the resulting accuracy metrics presented in Figure 7.

As shown in Figure 7, with the cosine similarity-based results serving as the baseline, the performance of embedding models, the effectiveness of semantic similarity measurement, and computational efficiency are analyzed, leading to the following conclusions: different embedding models demonstrate varying capabilities in comprehending regulatory texts. The mxbai model exhibits optimal performance, achieving 80% accuracy under the specified experimental conditions (Top-k recall strategy, threshold = 6, MDS), highlighting its superior understanding of regulatory semantics. Within this RAG testing framework, the approach provides quantifiable citation requirements, offering empirical support for recall strategy threshold selection. This solution is recommended for regulatory RAG frameworks, supplemented with a dynamic threshold adjustment mechanism. When either recall volume or accuracy rates reach predetermined thresholds, a manual review process should be initiated to balance operational efficiency with risk control requirements.

Under the Top-k recall strategy, when employing the mxbai embedding model with recall thresholds between 2 and 6, the performance ranking of semantic matching methods follows: MDS > CS > PCA. However, with lower-performing embedding models (e.g., bge-m3 and all-minilm), CS outperforms MDS by 2–16%. This indicates that MDS method requires higher-quality embedding models to be effective—its application may amplify semantic matching deficiencies when used with suboptimal embedding models. Across all tested embedding models, PCA consistently demonstrated inferior performance. This is primarily attributed to PCA’s inherent limitation in capturing nonlinear semantic relationships, resulting in comparatively poorer matching accuracy.

The computational time complexity of these three methods is systematically evaluated. Cosine similarity demonstrates

O (N d)

complexity. For MDS, the requirement to compute pairwise cosine similarities between all vectors results in

O (n^{2} d)

complexity, with additional

O (n^{3})

complexity for singular value decomposition (SVD) during matrix operations. PCA involves mean calculation and centering operations with

O (n d)

complexity, followed by covariance matrix computation at

O (d^{2} n)

complexity and its subsequent SVD at

O (d^{3})

complexity, culminating in principal component projection with

O (2 n d)

complexity (where

n

represents vector count and

d

denotes vector dimensionality). Under large-scale knowledge base conditions, both MDS and PCA exhibit significantly lower computational efficiency compared to CS, which may adversely impact practical application performance, necessitating methodological optimizations for deployment scenarios.

5. Retrieval Method Based on Information Reasoning

To enhance retrieval accuracy for knowledge fragments, the RAG framework incorporates an information reasoning-based optimization method. Within this framework, segmented knowledge fragments undergo processing by a large language model to deduce their potentially implicit information. By default, this hidden information reasoning approach is configured to derive the probable user questions corresponding to each knowledge fragment. The system then computes a composite similarity score by evaluating the relationships between the original knowledge fragment, its inferred information, and the user’s query, ultimately generating a ranked list of knowledge fragments according to the specified recall strategy. The precise computational method for determining this comprehensive similarity metric is formally expressed in Formula (7).

M e r g e (q, v_{i}, u_{i}) = \{\begin{matrix} \frac{\frac{q \cdot v_{i}}{‖q‖ \cdot ‖v_{i}‖} + \frac{q \cdot u_{i}}{‖q‖ \cdot ‖u_{i}‖}}{2}, & C S \\ \frac{\frac{1}{1 + {‖q^{M D S} - v_{i}^{M D S}‖}_{2}} + \frac{1}{1 + {‖q^{M D S} - u_{i}^{M D S}‖}_{2}}}{2}, & M D S \\ \frac{\frac{q^{P C A} \cdot v_{i}^{P C A}}{‖q^{P C A}‖ \cdot ‖v_{i}^{P C A}‖} + \frac{q^{P C A} \cdot u_{i}^{P C A}}{‖q^{P C A}‖ \cdot ‖u_{i}^{P C A}‖}}{2}, & P C A \end{matrix}

(7)

In the formula,

q

represents the vector corresponding to the user’s query,

v_{i}

denotes the vector of the knowledge fragment, and

u_{i}

indicates the vector of the inferred information.

To evaluate the applicability of traditional cosine similarity versus MDS-based semantic similarity measurement for similarity computation, six financial regulation documents are selected to test different retrieval approaches. The testing conditions are as follows:

The mxbai embedding model is utilized, which demonstrates optimal performance for this domain according to the tests in Section 3.
The Top-k recall strategy is implemented, and shows broader applicability based on the test results in Section 3.
The financial regulation knowledge base and query dataset specifications are detailed in Table 3.

Using the corresponding regulatory knowledge base, each query in the dataset is systematically tested to evaluate question-answering performance, yielding the accuracy distribution presented in Figure 8.

The mxbai model with Top-k recall and traditional cosine similarity serves as the baseline for comparison. Figure 8 demonstrates that hidden information reasoning can enhance RAG accuracy in small-scale knowledge bases. Additionally, the data reveals that MDS-based computation achieves higher accuracy than cosine similarity for most regulatory documents.

To investigate the applicability of four retrieval approaches for large-scale financial regulation datasets—traditional cosine similarity (Method Normal CS), cosine similarity with information reasoning (Method Reasoning of CS), MDS-based semantic similarity measurement (Method Normal MDS), and MDS-based semantic similarity measurement with information reasoning (Method Reasoning of MDS)—a comprehensive study is conducted. Fifteen categories of financial regulations are collected to construct a domain-specific large-scale knowledge base. For this knowledge base, a corresponding query dataset is generated through DeepSeek and manually verified methods. The resulting distribution is presented as a histogram in Figure 9, illustrating the ranking positions of correctly retrieved knowledge fragments across different retrieval methods and reasoning conditions.

The results demonstrate that for large-scale knowledge bases without information reasoning, MDS exhibits compromised matching accuracy compared to CS, as evidenced by the posterior shift in its distribution median. This indicates limited suitability of MDS for large knowledge repositories. However, test data from Section 4.4 and Section 5 confirm the method’s applicability to small-scale knowledge bases.

The semantic information reasoning significantly enhances the effectiveness of traditional CS matching, as evidenced by the highest and most forward-shifted distribution peak, indicating that the correct knowledge fragments achieve higher similarity rankings under this approach. However, the information reasoning does not improve the matching accuracy of MDS-based semantic similarity measurement and even exhibits detrimental effects. This phenomenon primarily occurs because while the semantic information reasoning combined with MDS enriches the semantics of correct knowledge fragments, it simultaneously enhances the semantic representations of other knowledge fragments and aligns them toward more generic information. Consequently, the similarity rankings of correct knowledge fragments are shifted backward, thereby reducing matching accuracy. Therefore, for large-scale knowledge bases, semantic information reasoning can be employed to process the knowledge base. The cosine similarity measurement is then utilized to compute the similarity between each knowledge fragment and the user’s query, through which the most relevant knowledge fragments are selected. Subsequently, MDS is applied to optimize the ranking of the top-ranked fragments, thereby further improving the accuracy of the retrieval process. This approach shares conceptual similarities with re-rank models, where critical information is reordered to enhance the accuracy of semantic matching. Additional computational time is allocated to key processing stages to achieve higher precision. Notably, this solution requires no additional training. For domain-specific RAG systems, implementation after thorough testing is recommended.

6. Conclusions

To enhance the accuracy of semantic matching for regulatory texts, it is essential to address three core challenges in RAG technology: knowledge base construction, semantic retrieval, and generation control. First, an open-source RAG testing framework specifically designed for regulatory domains is developed. Subsequently, within this framework, comprehensive performance evaluations are conducted on domain-specific embedding models. Furthermore, a dimensionality-reduction-based semantic matching optimization method is proposed, along with an information-reasoning-enhanced retrieval approach that improves matching accuracy by latent semantic information extraction from knowledge fragments. Finally, the effectiveness of this combined strategy in enhancing regulatory text semantic matching is rigorously validated across multiple real-world regulatory datasets. Specifically, for regulatory texts, chapter-based segmentation effectively addresses the issue of logical unit fragmentation caused by conventional text splitting methods. Regarding embedding model selection, the mxbai-embed-large model demonstrates superior semantic comprehension of regulatory terminology, significantly outperforming other embedding models. The proposed MDS-based semantic similarity measurement method enhances the identification accuracy of key regulatory provisions in small-scale knowledge bases by preserving high-dimensional semantic relationships while eliminating semantic noise. The quantity-based recall strategy proves effective in adapting to different embedding models while maintaining consistent retrieval quality. Additionally, the accuracy of semantic matching can be further improved through latent information inference processing of the knowledge base, which deduces potential user queries in advance.

This study adopts financial regulation question-answering as a representative scenario, implementing dynamic knowledge base management to achieve timely updates of regulatory provisions. The RAG testing framework systematically validated the implementation effectiveness of various combined approaches, with the optimal solution demonstrating significant practical value in real-world regulatory question-answering applications. Currently deployed in reimbursement and auditing scenarios, the framework has effectively addressed critical challenges including imprecise professional term matching and low accuracy in RAG systems. The open-source RAG testing framework achieves comprehensive optimization across the entire “knowledge processing-retrieval-generation” pipeline. It supports dynamic configuration of text segmentation strategies and embedding models, incorporates a visual debugging interface, and implements hidden information enhancement mechanisms—leveraging LLMs to inversely deduce implicit queries or supplement critical information, thereby expanding semantic coverage. This framework ensures interpretable retrieval processes while providing multi-dimensional evaluation metrics for comparative analysis, significantly lowering the barrier to RAG testing. By facilitating industry–academia–research collaboration, it accelerates the translation of regulatory intelligence applications from laboratory research to industrial practice, delivering reliable technical infrastructure and implementation pathways for sectoral digital transformation.

Future research will focus on developing multi-phase and hybrid recall strategies, along with multimodal regulatory text analysis, to further refine the lifecycle management mechanisms for regulatory knowledge and RAG methodologies. Additionally, during embedding model training, MDS-based semantic noise filtration could be implemented to preserve core semantic relationships and enhance model performance. This study establishes a replicable methodological framework for domain-specific RAG systems, with technical approaches offering significant reference value for processing highly specialized texts in regulatory, financial, and healthcare domains.

Author Contributions

Conceptualization, H.W., S.W. and B.L.; methodology, S.W.; software, H.W.; validation, H.W.; formal analysis, B.L.; resources, T.H., X.L. (Xin Liang) and X.L. (Xing Luo); data curation, B.L.; writing—original draft preparation, H.W., B.L. and S.W.; writing—review and editing, T.H., X.L. (Xin Liang) and X.L. (Xing Luo); supervision, T.H., X.L. (Xin Liang) and X.L. (Xing Luo); project administration, S.W.; funding acquisition, T.H., X.L. (Xin Liang) and X.L. (Xing Luo). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Theoretical research independently initiated project, grant number BHJ23C001; the Advanced Interdisciplinary Project of PCL, grant number 2025qyb011; the Major Key Project of PCL, grant number PCL2023A09; the Educational Commission of Guangdong Province, grant number 2021ZDZX1069; Southern Key Laboratory of Technology Finance (Guangdong).

Data Availability Statement

The data presented in this study are available on request from the GitHub repositories: https://github.com/wh1207/Knowledge, accessed on 8 July 2025 and data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ji, S.; Liu, L.; Xi, J.; Zhang, X.; Li, X. KLR-KGC: Knowledge-Guided LLM Reasoning for Knowledge Graph Completion. Electronics 2024, 13, 5037. [Google Scholar] [CrossRef]
Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.-S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 6491–6501. [Google Scholar]
He, Y.; Zhu, X.; Li, D.; Wang, H. Enhancing Large Language Models for Specialized Domains: A Two-Stage Framework with Parameter-Sensitive LoRA Fine-Tuning and Chain-of-Thought RAG. Electronics 2025, 14, 1961. [Google Scholar] [CrossRef]
Yang, W.; Some, L.; Bain, M.; Kang, B. A comprehensive survey on integrating large language models with knowledge-based methods. Knowl.-Based Syst. 2025, 318, 113503. [Google Scholar] [CrossRef]
Silva, L.; Barbosa, L. Improving dense retrieval models with LLM augmented data for dataset search. Knowl.-Based Syst. 2024, 294, 111740. [Google Scholar] [CrossRef]
Ghali, M.-K.; Farrag, A.; Won, D.; Jin, Y. Enhancing knowledge retrieval with in-context learning and semantic search through generative AI. Knowl.-Based Syst. 2025, 311, 113047. [Google Scholar] [CrossRef]
Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv 2019, arXiv:1908.10063. [Google Scholar] [CrossRef]
Chen, Y.; Wang, W.; Liu, Z.; Lin, X. Keyword search on structured and semi-structured data. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, Providence, RI, USA, 29 June–2 July 2009; pp. 1005–1010. [Google Scholar]
Chang, F.; Dean, J.; Ghemawat, S.; Hsieh, W.C.; Wallach, D.A.; Burrows, M.; Chandra, T.; Fikes, A.; Gruber, R.E. Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 2008, 26, 4. [Google Scholar] [CrossRef]
Grishman, R. Information Extraction. IEEE Intell. Syst. 2015, 30, 8–15. [Google Scholar] [CrossRef]
Nasar, Z.; Jaffry, S.W.; Malik, M.K. Information extraction from scientific articles: A survey. Scientometrics 2018, 117, 1931–1990. [Google Scholar] [CrossRef]
Singh, S. Natural Language Processing for Information Extraction. arXiv 2018, arXiv:1807.02383. [Google Scholar] [CrossRef]
Ji, S.; Pan, S.; Cambria, E.; Marttinen, P.; Yu, P.S. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 494–514. [Google Scholar] [CrossRef] [PubMed]
Bayer, O.; Ulu, E.N.; Sarkın, Y.; Sütçü, E.; Çelik, D.B.; Karamanlıoglu, A.; Karakaya, I.; Demirel, B. A REGNLP Framework: Developing Retrieval-Augmented Generation for Regulatory Document Analysis. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; 97p. [Google Scholar]
Xia, W.; Zou, X.; Jiang, H.; Zhou, Y.; Liu, C.; Feng, D.; Hua, Y.; Hu, Y.; Zhang, Y. The Design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 2017–2031. [Google Scholar] [CrossRef]
Jugran, S.; Kumar, A.; Tyagi, B.S.; Anand, V. Extractive Automatic Text Summarization using SpaCy in Python & NLP. In Proceedings of the 2021 International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, India, 4–5 March 2021; pp. 582–585. [Google Scholar]
Kumar, M.; Chaturvedi, K.K.; Sharma, A.; Arora, A.; Farooqi, M.S.; Lal, S.B.; Lama, A.; Ranjan, R. An Algorithm for Automatic Text Annotation for Named Entity Recognition Using SpaCy Framework; Research Square: Durham, NC, USA, 2023. [Google Scholar] [CrossRef]
Kau, A.; He, X.; Nambissan, A.; Astudillo, A.; Yin, H.; Aryani, A. Combining Knowledge Graphs and Large Language Models. arXiv 2024, arXiv:2407.06564. [Google Scholar] [CrossRef]
Zuo, J.; Niu, J. Construction of Journal Knowledge Graph Based on Deep Learning and LLM. Electronics 2025, 14, 1728. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Liang, L.; Bo, Z.; Gui, Z.; Zhu, Z.; Zhong, L.; Zhao, P.; Sun, M.; Zhang, Z.; Zhou, J.; Chen, W.; et al. KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 334–343. [Google Scholar]
Jiang, B.; Yang, J.; Yang, C.; Zhou, W.; Pang, L.; Zhou, X. Knowledge Augmented Dialogue Generation with Divergent Facts Selection. Knowl.-Based Syst. 2020, 210, 106479. [Google Scholar] [CrossRef]
Jin, M.; Zheng, Y.; Li, Y.-F.; Gong, C.; Zhou, C.; Pan, S. Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning. arXiv 2021, arXiv:2105.05682. [Google Scholar] [CrossRef]
Aumiller, D.; Almasian, S.; Lackner, S.; Gertz, M. Structural text segmentation of legal documents. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, São Paulo, Brazil, 21–25 June 2021; pp. 2–11. [Google Scholar]
Balinsky, A.; Balinsky, H.; Simske, S. Rapid change detection and text mining. In Proceedings of the 2nd Conference on Mathematics in Defence (IMA), Defence Academy, Swindon, UK, 20 October 2011. [Google Scholar]
Zhang, L.; Xiang, T.; Gong, S. Learning a Deep Embedding Model for Zero-Shot Learning. arXiv 2016, arXiv:1611.05088. [Google Scholar] [CrossRef]
Peng, Q.; Cao, B.; Xie, X.; Ye, H.; Liu, J.; Li, Z. LLMSRec: Large language model with service network augmentation for web service recommendation. Knowl.-Based Syst. 2025, 323, 113710. [Google Scholar] [CrossRef]
Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2013, Hefei, China, 20–23 October 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
Xia, P.; Zhang, L.; Li, F. Learning similarity with cosine similarity ensemble. Inf. Sci. 2015, 307, 39–52. [Google Scholar] [CrossRef]
Ma, L.; Zhou, Y.; Ma, Y.; Yu, G.; Li, Q.; He, Q.; Pei, Y. Defying Multi-model Forgetting in One-shot Neural Architecture Search Using Orthogonal Gradient Learning. IEEE Trans. Comput. 2025, 74, 1678–1689. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Sirts, K.; Qu, L.; Johnson, M. STransE: A novel embedding model of entities and relationships in knowledge bases. arXiv 2016, arXiv:1606.08140. [Google Scholar] [CrossRef]
Kadhim, A.K.; Jiao, L.; Shafik, R.; Granmo, O.-C. Omni TM-AE: A Scalable and Interpretable Embedding Model Using the Full Tsetlin Machine State Space. arXiv 2025, arXiv:2505.16386. [Google Scholar] [CrossRef]
Weller, O.; Ricci, K.; Yang, E.; Yates, A.; Lawrie, D.; Van Durme, B. Rank1: Test-Time Compute for Reranking in Information Retrieval. arXiv 2025, arXiv:2502.18418. [Google Scholar] [CrossRef]
Liu, Q.; Wang, B.; Wang, N.; Mao, J. Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models. In Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 28 April–2 May 2025; pp. 4274–4283. [Google Scholar]
Tymoshenko, K.; Moschitti, A. Shallow and Deep Syntactic/Semantic Structures for Passage Reranking in Question-Answering Systems. ACM Trans. Inf. Syst. 2018, 37, 8. [Google Scholar] [CrossRef]
Aktolga, E.; Allan, J.; Smith, D.A. Passage Reranking for Question Answering Using Syntactic Structures and Answer Types. In Advances in Information Retrieval, ECIR 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 617–628. [Google Scholar]
Bai, L.; Xiao, Q.; Zhu, L. Multi-hop path reasoning of temporal knowledge graphs based on generative adversarial imitation learning. Knowl.-Based Syst. 2025, 316, 113421. [Google Scholar] [CrossRef]
Feng, Y.; Li, C.; Ng, V. Legal Judgment Prediction via Event Extraction with Constraints. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 648–664. [Google Scholar]
Cao, X.; Liu, Y.; Sun, F. Predict, pretrained, select and answer: Interpretable and scalable complex question answering over knowledge bases. Knowl.-Based Syst. 2023, 278, 110820. [Google Scholar] [CrossRef]
Li, B.; Fan, S.; Zhu, S.; Wen, L. CoLE: A collaborative legal expert prompting framework for large language models in law. Knowl.-Based Syst. 2025, 311, 113052. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Sun, J.; Du, Y.; Wen, Y.; Wang, X.; Pan, W. Cooperative Open-ended Learning Framework for Zero-Shot Coordination. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Proceedings of Machine Learning Research. pp. 20470–20484. [Google Scholar]
Guan, X.; Liu, Y.; Lin, H.; Lu, Y.; He, B.; Han, X.; Sun, L. Mitigating large language model hallucinations via autonomous knowledge graph-based retrofitting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; pp. 18126–18134. [Google Scholar]
Li, Z.; Shareghi, E.; Collier, N. ReasonGraph: Visualisation of Reasoning Paths. arXiv 2025, arXiv:2503.03979. [Google Scholar] [CrossRef]
Desnos, A. Android: Static analysis using similarity distance. In Proceedings of the 2012 45th Hawaii International Conference on System Sciences, Maui, HI, USA, 4–7 January 2012; pp. 5394–5403. [Google Scholar]
Laiho, M.; Poikonen, J.H.; Kanerva, P.; Lehtonen, E. High-dimensional computing with sparse vectors. In Proceedings of the 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS), Atlanta, GA, USA, 22–24 October 2015; pp. 1–4. [Google Scholar]
Chen, J.; Yang, S.; Wang, Z.; Mao, H. Efficient Sparse Representation for Learning With High-Dimensional Data. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 4208–4222. [Google Scholar] [CrossRef]
mxbai-embed-large-v1. Available online: https://gitcode.com/hf_mirrors/ai-gitcode/mxbai-embed-large-v1 (accessed on 28 May 2025).
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the 62nd Annual Meeting of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2318–2335. [Google Scholar]
Yin, C.; Zhang, Z. A Study of Sentence Similarity Based on the All-minilm-l6-v2 Model with “Same Semantics, Different Structure” After Fine Tuning. In Proceedings of the 2024 2nd International Conference on Image, Algorithms and Artificial Intelligence (ICIAAI 2024), Singapore, 9–11 August 2024; pp. 677–684. [Google Scholar]
Nussbaum, Z.; Morris, J.X.; Duderstadt, B.; Mulyar, A. Nomic Embed: Training a Reproducible Long Context Text Embedder. arXiv 2024, arXiv:2402.01613. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar] [CrossRef]
Sainburg, T.; McInnes, L.; Gentner, T.Q. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Comput. 2021, 33, 2881–2907. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Huang, H.; Rudin, C.; Shaposhnik, Y. Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMAP, and PaCMAP for data visualization. J. Mach. Learn. Res. 2021, 22, 1–73. [Google Scholar]
Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. In Image and Signal Processing; Springer: Cham, Switzerland, 2020; pp. 317–325. [Google Scholar]
Arora, S.; Hu, W.; Kothari, P.K. An Analysis of the t-SNE Algorithm for Data Visualization. In Proceedings of the 31st Conference On Learning Theory, Stockholm, Sweden, 6–9 July 2018; Proceedings of Machine Learning Research. pp. 1455–1462. [Google Scholar]

Figure 1. RAG testing illustration.

Figure 2. RAG testing interface.

Figure 3. Knowledge base processing and RAG workflow.

Figure 4. The variation in answer accuracy rates across different retrieval strategies under varying threshold conditions.

Figure 5. MDS-based similarity computation process.

Figure 6. The MDS-based similarity computation results.

Figure 7. Matching accuracy of different semantic similarity computation methods.

Figure 8. Ranking distribution of similarity scores computed by different approaches.

Figure 9. Rank distribution of computed similarity metrics (large-scale dataset).

Table 1. Characteristics of different embedding models.

Model	Parameters	Default Dimensions	Context Length	Features
mxbai-embed-large	335M	1024	512 tokens	The model supports multilingual retrieval and is well-suited for short-text tasks, though its capability for processing long texts is relatively limited [47].
Bge-m3	420M	1024	8192 tokens	The model demonstrates strong capabilities in long-text processing and cross-lingual tasks, making it particularly suitable for knowledge base question-answering, though its larger model size results in higher resource consumption [48].
ALL-MiniLM	33M	384	256 tokens	The model supports deployment on lightweight devices with primary English-language support, though it exhibits performance degradation on long-text processing [49].
Nomic-embed-text	137M	768	8192 tokens	The model features transparent training data and processes, while demonstrating average performance on complex tasks [50].

Table 2. Accuracy rates under different recall strategy and threshold conditions (the cosine similarity method).

Recall Strategy	Embedded Model	Threshold
Recall Strategy	Embedded Model	0.58	0.55	0.52	0.49	0.46
Similarity-based recall	mxbai	100.00%	100.00%	100.00%	100.00%	100.00%
	bge-m3	18.00%	20.00%	24.00%	30.00%	46.00%
	all-minilm	14.00%	24.00%	36.00%	54.00%	70.00%
	nomic	58.00%	70.00%	84.00%	98.00%	100.00%
		$F = 24.41, p < 0.001$
		2	3	4	5	6
Top-k recall	mxbai	48.00%	54.00%	62.00%	68.00%	78.00%
	bge-m3	24.00%	40.00%	54.00%	62.00%	72.00%
	all-minilm	40.00%	50.00%	58.00%	66.00%	74.00%
	nomic	36.00%	44.00%	52.00%	62.00%	74.00%
		$F = 0.57, p = 0.65$
		2500	3000	3500	4000	4500
Length-constrained recall	mxbai	38.00%	42.00%	48.00%	50.00%	54.00%
	bge-m3	30.00%	38.00%	40.00%	40.00%	44.00%
	all-minilm	32.00%	34.00%	38.00%	46.00%	50.00%
	nomic	28.00%	38.00%	42.00%	42.00%	46.00%
		$F = 1.53, p = 0.245$

Table 3. Configuration of the regulatory knowledge base and query dataset.

Regulation	Name of Regulation	Number of Fragments	Number of Questions
Regulation 1	Regulations on the Bidding and Tendering Process for Government Procurement of Goods and Services	12	50
Regulation 2	Financial Department Supervision Measures	6	36
Regulation 3	Fiscal Bill Management Measures	7	35
Regulation 4	Management Measures for Loans and Grants from International Financial Institutions and Foreign Governments	7	39
Regulation 5	Administrative Measures for the Transfer of State-owned Assets in Financial Enterprises	8	64
Regulation 6	Approval and Supervision Measures for Asset Appraisal Institutions	6	48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, B.; Wen, H.; Wang, S.; Hu, T.; Liang, X.; Luo, X. An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts. Electronics 2025, 14, 2856. https://doi.org/10.3390/electronics14142856

AMA Style

Li B, Wen H, Wang S, Hu T, Liang X, Luo X. An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts. Electronics. 2025; 14(14):2856. https://doi.org/10.3390/electronics14142856

Chicago/Turabian Style

Li, Bingjie, Haolin Wen, Songyi Wang, Tao Hu, Xin Liang, and Xing Luo. 2025. "An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts" Electronics 14, no. 14: 2856. https://doi.org/10.3390/electronics14142856

APA Style

Li, B., Wen, H., Wang, S., Hu, T., Liang, X., & Luo, X. (2025). An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts. Electronics, 14(14), 2856. https://doi.org/10.3390/electronics14142856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimized Semantic Matching Method and RAG Testing Framework for Regulatory Texts

Abstract

1. Introduction

2. RAG Testing Framework Design

3. Embedding Models Testing

4. Semantic Matching Optimization Methodology

4.1. Cosine Similarity

4.2. MDS

4.3. Principal Component Analysis

4.4. Experimental Analysis

5. Retrieval Method Based on Information Reasoning

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI