Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data

Kim, Donghyeon; Park, Minki; Lee, Jungsun; Lee, Inho; Jin, Jeonghyeon; Sung, Yunsick

doi:10.3390/math13152469

Open AccessArticle

Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data

by

Donghyeon Kim

^1,†,

Minki Park

^1,†,

Jungsun Lee

^1,†,

Inho Lee

²

,

Jeonghyeon Jin

²

and

Yunsick Sung

^2,*

¹

Department of Computer Information and Communication Engineering, Dongguk University-Seoul, Seoul 04620, Republic of Korea

²

Department of Computer Science and Artificial Intelligence, Dongguk University-Seoul, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(15), 2469; https://doi.org/10.3390/math13152469

Submission received: 4 July 2025 / Revised: 25 July 2025 / Accepted: 29 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Big Data Analysis, Computing and Applications)

Download

Browse Figure

Versions Notes

Abstract

The exponential increase in textual data generated across sectors such as healthcare, finance, and smart manufacturing has intensified the need for effective Big Data analytics. Large language models (LLMs) have become critical tools because of their advanced language processing capabilities. However, their static nature limits their ability to incorporate real-time and domain-specific knowledge. Retrieval-augmented generation (RAG) addresses these limitations by enriching LLM outputs through external content retrieval. Nevertheless, traditional RAG systems remain inefficient, often exhibiting high retrieval latency, redundancy, and diminished response quality when scaled to large datasets. This paper proposes an innovative structured RAG framework specifically designed for large-scale Big Data analytics. The framework transforms unstructured partial prompts into structured semantically coherent partial prompts, leveraging element-specific embedding models and dimensionality reduction techniques, such as principal component analysis. To further improve the retrieval accuracy and computational efficiency, we introduce a multi-level filtering approach integrating semantic constraints and redundancy elimination. In the experiments, the proposed method was compared with structured-format RAG. After generating prompts utilizing two methods, silhouette scores were computed to assess the quality of embedding clusters. The proposed method outperformed the baseline by improving the clustering quality by 32.3%. These results demonstrate the effectiveness of the framework in enhancing LLMs for accurate, diverse, and efficient decision-making in complex Big Data environments.

Keywords:

big data; large language models (LLMs); retrieval-augmented generation (RAG); structured prompt; semantic embedding; dimensionality reduction; principal component analysis (PCA); conditional filtering

MSC:

68T07; 68T30; 68T05; 68T50

1. Introduction

The exponential growth of textual data across sectors, such as healthcare, finance, and smart manufacturing, fueled by the widespread adoption of IoT devices, digital services, and user-generated content, has made the effective management and insightful utilization of Big Data critical [1,2]. As datasets expand in scale and complexity, the need to employ advanced analytical methods, particularly deep learning and large language models (LLMs), to derive actionable insights has intensified.

LLMs, renowned for their advanced language understanding and generation capabilities, have become indispensable tools for analyzing large-scale textual datasets. They support data-driven decision-making by transforming complex unstructured information into coherent contextually appropriate outputs. Nevertheless, despite their strengths, LLMs encounter critical limitations. Their static nature and lack of integration with real-time, dynamic, and domain-specific knowledge hinders their applicability in practical large-scale Big Data environments [3,4,5,6]. Domain-specific models such as BloombergGPT [7] and BioBERT [8] exemplify attempts to mitigate this issue through specialization.

Retrieval-augmented generation (RAG) architectures have emerged as a promising solution to address these limitations. By retrieving relevant external content from vast embedded textual datasets stored in specialized vector databases (vector stores), RAG enhances the factual accuracy and contextual relevance of LLM-generated outputs [9,10]. Nonetheless, traditional RAG systems face notable inefficiencies and a decline in accuracy when scaled to handle substantial volumes of Big Data [11,12]. In particular, the retrieval latency increases significantly with the dataset size, while redundant or overlapping data retrieval often degrade the response quality and lead to inefficient use of computational resources [7,13]. Furthermore, traditional RAG systems often rely on flat and unstructured data representations, which fail to capture the nuanced semantic relationships among prompt elements. This structural limitation hinders the system’s ability to preserve user intent and contextual coherence during retrieval. As a result, these systems struggle to maintain both efficiency and relevance under the demands of high-volume real-time applications.

This paper introduces a novel structured RAG framework specifically designed for large-scale Big Data analytics. The framework transforms unstructured partial prompts into semantically coherent and clearly structured formats, thereby enhancing the accuracy, context-preservation, and scalability of RAG processes. Our approach integrates element-specific embedding models combined with dimensionality reduction techniques, such as principal component analysis (PCA) [14], to improve retrieval precision and computational efficiency. In addition, we introduce a multi-level filtering mechanism that integrates element-based semantic constraints and redundancy elimination, ensuring higher precision, reduced latency, and improved response diversity. The final set of partial prompts by the proposed method is then merged to form complete prompts for generation.

2. Related Work

RAG has recently attracted significant attention as an effective method for improving the capabilities of LLMs. The seminal work by Lewis et al. [9] introduces RAG, which combines dense retrieval methods and transformer-based generative models, significantly improving both factual accuracy and contextual relevance. Subsequent studies, such as REALM [10], further refine dense passage retrieval techniques, enhancing retrieval precision.

To address scalability and efficiency, Izacard and Grave introduced Fusion-in-Decoder [11], a method that integrates multiple retrieved passages at the decoder stage. Atlas [12] introduces few-shot fine-tuning techniques for knowledge-intensive tasks, while adaptive RAG and apeculative RAG frameworks [7,13] utilize retrieval step modulation and draft generation to reduce latency. A comprehensive survey [15] outlines key architectural variations and enhancement strategies. Despite these advancements, challenges such as factual inconsistency and hallucinations persist in open-domain systems.

More recently, RAG has been enhanced via LoRA-based techniques, which improve parameter efficiency and adaptability to new tasks. For example, Choi et al. [16] proposed a LoRA-enhanced RAG framework that achieved improved accuracy and latency balance. While LoRA has primarily been applied in vision-language models [17,18], its integration into retrieval-based natural language processing systems suggests promising directions for future scalability.

Prompt engineering techniques have further enhanced LLM interpretability and control [19]. Surveys on structured prompts [20], PromptBench [21], and reasoning-specific methods such as chain-of-thought prompting [22] and zero-shot CoT [20] underscore the effectiveness of well-written prompts.

In terms of vector retrieval, the quality of the embedding remains crucial. Massive benchmarks like MTEB [23] and domain-specific embedding evaluations [24] have demonstrated significant variance across models. Dense passage retrieval [25] remains a strong baseline, and PCA-based reduction [14] has proven effective in improving computational efficiency without compromising accuracy.

Multi-stage filtering approaches such as IMRRF [26] and ChunkRAG [27] have been proposed to further refine output quality, enabling redundancy-aware retrieval pipelines that are better suited for large-scale applications.

Finally, the versatility of RAG and LLM applications is evident across multiple domains. Smart manufacturing workflows leverage LLMs for diagnostics and maintenance optimization [21,28]. The finance sector utilizes models like FinGPT for real-time document analysis [29,30], while healthcare research increasingly adopts LLMs and RAG for clinical question answering and medical decision support [31,32].

3. Structured Prompting and Semantic Filtering Framework

3.1. Overview

When users create prompts, they often face difficulties in writing those prompts. Our system suggests keywords for users by utilizing RAG to deduce recommendable keywords. RAG enhances the factual consistency of LLM outputs by incorporating relevant external information. However, traditional RAG systems frequently encounter challenges such as contextual ambiguity, redundancy, and inefficient processing, especially when handling unstructured text in large-scale applications. To address these limitations, we propose a structured and semantic filtering framework to enhance the retrieval precision, semantic coherence, and output diversity. As illustrated in Figure 1, the proposed framework is designed to systematically address these challenges through a multi-step process.

The proposed framework comprises four steps, as follows:

Step 1. Structured Prompt
User inputs from the interface are transformed into a structured partial prompt by applying a predefined semantic schema, which preserves contextual relationships and ensures semantic consistency. The structured partial prompt consists of categories and sections, where each section contains keywords selected via the user interface. For instance, in healthcare Big Data, structuring with ‘Speaker: Doctor’ and ‘Required: Symptom Analysis’ semantically reconstructs unstructured data, reducing the retrieval ambiguity by 20–30%. This is evidenced by improved silhouette scores in experiments, enhancing prompt consistency and ensuring retrieval quality in large-scale datasets.
Step 2. Element-specific Embedding and PCA
The elements of the structured partial prompt are embedded individually, after which PCA is applied to reduce the dimensionality of the embeddings while preserving their semantic features. Reducing high-dimensional embeddings (e.g., 1024 dimensions) to three dimensions increases computational efficiency while maintaining semantic features, removing noise in Big Data clustering to enhance retrieval accuracy.
Step 3. Embedding Filtering
A multi-level filtering mechanism is applied to the reduced embeddings considering the semantic constraints of the elements. Category filtering followed by Top-k similarity filtering removes irrelevant candidates from large-scale datasets, thereby strengthening the semantic consistency of the partial prompt.
Step 4. Value Filtering
Redundant embeddings that are semantically similar or identical to those of the elements of the structured prompt are excluded to enhance the response diversity and reduce the computational overhead. Excluding semantic duplicates (e.g., repeated keywords) from Top-k candidates increases the response diversity, promoting the generation of novel insights in Big Data retrieval.

The recommended keywords are obtained through value-based filtering. Users can modify section values via the interface, considering the recommended keywords. When the structured partial prompt is finalized, it is combined into complete prompts.

3.2. Structured Prompt Construction

Traditional RAG systems frequently rely on unstructured text input, which hinders contextual preservation and systematic filtering. To address this, we introduce a structured prompt format comprising a category, sub-category, and six predefined semantic sections. The category and sub-category specify the domain of the structured prompt. The semantic sections are Speaker, Listener, Instruction, Format, Required, and Excluded. This structure ensures consistent interpretation and facilitates more precise embedding and retrievals. Each section captures a distinct aspect of a prompt. For instance, Speaker defines an information source, while the Listener specifies an intended recipient. The Instruction describes a task, and the Format sets the expected response format. The Required and Excluded sections represent key content constraints for inclusion or exclusion, respectively. As shown in Algorithm 1, the construction of a structured prompt follows a three-step algorithm:

Algorithm 1 Structured Prompt Construction Algorithm

Require:: Unstructured user input
Ensure:: Structured prompt in JSON-like format
1:: Step 1: Category Identification
2:: Category ← classifyDomain(input)
3:: SubCategory ← refineSubdomain(input)
4:: Step 2: Section Mapping
5:: for each relevant component c in input do
6:: Identify the section type for c (e.g., Speaker, Listener, Instruction, etc.)
7:: Map c to the appropriate section using rules or models
8:: end for
9:: Step 3: JSON Structuring
10:: Construct a structured prompt with identified sections:
11:: StructuredPrompt←{“Category”: Category, “Sub-category”: SubCategory, “Speaker”: Speaker, …}
12:: return StructuredPrompt

3.3. Element-Specific Embedding and PCA

A structured prompt is embedded through an element-specific embedding pipeline. Each element in the structured prompt is embedded utilizing one of the semantic similarity models, such as e5-large-v2, to produce high-dimensional vectors (e.g., 1024 dimensions) that encapsulate the semantic content of each component.

This pipeline, which embeds each element of the structured prompt individually, independently captures the semantic meanings of sections such as Speaker and Listener using element-specific embedding models (e.g., e5-large-v2 ), thereby enabling fine-grained semantic representation of the entire prompt. Unlike the flat embedding approach of traditional RAG systems, this significantly enhances the retrieval accuracy and response diversity, while contributing to reduced redundancy and improved computational efficiency in Big Data environments. For example, embedding ‘Speaker: Teacher’ and ‘Required: Addition’ separately enables more precise searches in educational domain Big Data, with experimental results demonstrating a 32.3% improvement in the silhouette score.

To optimize the computational efficiency, we apply PCA to project high-dimensional embeddings into a lower-dimensional space with three dimensions. PCA is a linear and deterministic method that preserves the global semantic structure with minimal distortion, enabling fast and stable similarity computations. The results are consistent across runs, making PCA well-suited for semantic filtering and retrieval. As summarized in Table 1, PCA also compares favorably to other dimensionality reduction techniques such as t-SNE and UMAP in terms of computational efficiency, reproducibility, and interpretability.

We selected PCA for dimensionality reduction because it efficiently preserves the global variance structure and provides interpretable linear embeddings, which are suitable for downstream semantic filtering. In contrast, nonlinear methods such as t-SNE and UMAP are primarily optimized for visualization, focusing on local neighborhood preservation at the expense of global data structure, and they are less interpretable and less stable for subsequent analyses.

Classify each element in a structured prompt by its semantic meaning.
Generate high-dimensional vectors utilizing element-specific embedding models.
Apply PCA to reduce the dimensionality of the vectors while maintaining semantic meaning.

3.4. Embedding Filtering

To narrow the search space and enhance the semantic relevance, we employ a two-step filtering process on the embedding space prior to prompt retrieval. This approach ensures efficient selection of candidates in PCA vector embedding, which aligns with the user’s intent.

Category Filtering: Candidates in PCA vector embedding that do not match the category and the sub-category specified in a structured prompt are first removed from the embedding pool. This coarse filter ensures that only contextually relevant prompts remain, refining the initial search pool to conserve computational resources and exclude context-irrelevant data.
Top-k Based Filtering: From the category-filtered candidates, the top-k candidates are selected based on Euclidean similarity within the PCA-reduced vector space of the structured prompt. This step ensures that the most semantically relevant candidates are retained.

This two-stage embedding filtering pipeline effectively balances contextual alignment and computational efficiency, outperforming conventional RAG systems that rely solely on flat similarity rankings.

3.5. Value Filtering

Although the top-k embedding filtering narrows down semantically relevant candidates, redundant or near-duplicate entries may remain. To address this, we introduce value filtering.

Value Filtering: Each Top-k candidate is compared with a structured prompt. Candidates that are semantically or numerically equivalent to the input are excluded from the final output.

This final filtering ensures that the recommendations are relevant and diverse, thereby enhancing practical utility and novelty of the output. As a result, it contributes to reducing the retrieval latency of the overall framework and improving the quality of the clustering. This filtering mechanism, unlike the flat retrieval approach of traditional RAG systems, applies fine-grained semantic constraints, functioning as a powerful tool to support domain-specific decision-making.

3.6. Domain-Specific Decision Support

The proposed framework integrates structured prompting, element-specific embedding, and multi-level filtering to effectively support domain-specific decision-making in large-scale Big Data environments. In diverse domains such as healthcare, finance, and smart manufacturing, prompts are systematically classified into “Category” and “Sub-category,” enabling tailored retrieval of domain-relevant knowledge aligned with user intent and contextual constraints.

For instance, in healthcare applications, sections such as “Speaker: Doctor” and “Required: Symptom Analysis” prioritize domain-specific data like medical records and BioBERT-enhanced embeddings, thereby reducing retrieval ambiguity and facilitating real-time clinical decision support. In finance, the framework leverages specialized models such as BloombergGPT to filter and retrieve market-specific insights, minimizing redundancy and supporting data-driven decision-making under time-sensitive conditions.

Furthermore, the combination of PCA-based dimensionality reduction and semantic constraints in multi-stage filtering effectively excludes irrelevant or excluded elements, producing diverse and high-quality output. The experimental results demonstrate a 32.3% improvement in the silhouette score over traditional RAG, indicating superior clustering quality and semantic cohesion. This structured approach thus enables more precise and adaptive decision-making in complex domain-specific Big Data scenarios.

3.7. Generalizability to Summarization, QA, and Multilingual Tasks

Beyond domain-specific decision support, the proposed structured prompt-based RAG framework exhibits strong potential for generalization across diverse natural language processing tasks, including summarization, question answering (QA), and multilingual scenarios.

Summarization: By systematically structuring input prompts and applying element-specific embedding with multi-level semantic filtering, the framework can extract and compress information into concise contextually coherent summaries. Dimensionality reduction techniques, such as PCA, enhance the efficiency and relevance of retrieved content, supporting robust performance in automatic summarization tasks.
Question Answering (QA): The modular design, which features explicit semantic sections like “Speaker,” “Listener,” and “Instruction,” improves the retrieval and synthesis of relevant responses, even in complex QA scenarios. Category and subcategory classification enables fine-grained control over query scope, boosting factual accuracy, and reducing ambiguity.
Multilingual Extension: With support for domain-specific embedding models and a structured categorization process, the framework is readily extendable to multilingual settings. Integration with multilingual embedding models and conditional filtering protocols ensures that knowledge retrieval and generation remain accurate and context-appropriate across languages, facilitating global application in Big Data environments.

The experimental evidence suggests that the combination of these architectural components enables the framework to be flexibly adapted for various natural language processing tasks beyond decision support—while consistently maintaining precision, diversity, and scalability.

4. Experiments

In this work, we conducted a comprehensive evaluation of our proposed structured prompting and semantic filtering RAG system against a traditional RAG system, focusing on Top-k retrieval diversity. The experimental procedure comprised two phases: (1) Top-k candidate quality evaluation and (2) final output quality assessment by LLM responses. The purpose of the experiments is to demonstrate the limitations of traditional RAG compared to the proposed method, including poor contextual coherence and inadequate adaptation to diverse user prompts with varying structural complexity.

4.1. Experimental Environment and Dataset

The evaluation used lightweight transformer models to generate semantic embeddings, following our structured prompt process. The dataset construction followed our structured-based processing approach, systematically parsing and organizing the prompts into six predefined semantic sections. This structured formatting ensured consistent data representation and improved the reliability of embedding generation, unlike traditional RAG systems that process raw text without semantic segmentation.

4.2. Evaluation Metric

We used two key metrics to evaluate our framework: diversity and clustering quality.

Diversity was assessed by measuring the proportion of unique values within each section. For section i, the diversity score

d_{i}

is defined as the number of unique values divided by the total number of values in that section. The overall diversity d is then calculated as the average in all sections:

d = \frac{1}{| S |} \sum_{i \in S} d_{i},

(1)

where S is the set of sections. This metric addresses redundancy by quantifying how our deduplication mechanism at the embedding level increases the diversity and originality of recommendations compared to traditional RAG systems.

Clustering quality was evaluated using the silhouette score, which quantifies both the cohesion within clusters and the separation between clusters. The silhouette score ranges from

- 1

to

+ 1

, with higher values indicating more distinct and well-defined clusters. Using this metric, we demonstrated that the structured prompt and semantic filtering RAG framework produces retrieval clusters that are more consistent and clearly separated than those generated by conventional RAG systems.

4.3. Results

Table 2 presents the structured prompt samples utilized for experiments. Three inputs were selected to ensure sufficient semantic diversity in the evaluation.

Table 3 presents the top three retrieved outputs utilizing the proposed method. Table 4 presents retrieval results by the traditional structured-format RAG system for comparison.

Table 5 presents the LLM-generated responses by the outputs in Table 3. Table 6 presents the LLM results based on the traditional structured-format RAG outputs in Table 4.

Both results achieved a perfect diversity index score of 1.0, indicating non-redundant and varied retrieval. In addition, a silhouette score analysis was conducted to measure the clustering quality of the retrieved documents, utilizing Euclidean distance measurements between embedding vectors. As summarized in Table 7, the silhouette scores of the two experiments differed, reflecting differences in semantic cohesion. The proposed method yielded higher silhouette scores, suggesting stronger intra-cluster similarity and inter-cluster separation due to holistic semantic representation via normalized vector similarity.

In contrast, the traditional RAG system based on independent retrieval steps often prioritized term frequency or local keyword overlap, thus reducing semantic uniformity within clusters. Furthermore, the final LLM response clusters also exhibited clear differences in quality, as shown in Table 8.

5. Conclusions

This paper proposed a structured RAG framework, a structured and efficient alternative to traditional RAG systems, that addresses key challenges in prompt diversity, contextual coherence, and response redundancy. The proposed method systematically decomposed each partial prompt into six explicit semantic sections: speaker, listener, instruction, format, required, and excluded, thus preserving user intent and constraints with high fidelity. We independently embedded each component and integrated the resulting embedding vectors using dimensionality reduction techniques, such as PCA, to ensure both semantic distinctiveness and computational efficiency.

Our method introduced a multi-level condition-aware filtering mechanism that initially aligns candidates at the high-level category and subsequently verifies fine-grained constraints. This two-level process enables and accurate retrieval, as well as robust adaptation to diverse and instruction-heavy prompts. Furthermore, we implemented embedding-level deduplication to minimize redundancy and enhance the novelty of generated responses.

The experimental results demonstrated that our structured RAG framework outperforms conventional RAG systems in both clustering quality, as measured by silhouette scores, and information, as indicated by section-wise diversity indices. Our structured RAG framework demonstrated a 32.3% improvement in silhouette scores. Although both systems recorded perfect diversity scores, our structured RAG framework consistently yielded more homogeneous and semantically cohesive clusters, attributable to its structured representation and integrated similarity filtering.

In summary, the proposed RAG framework offers a principled and practical solution to the limitations of traditional RAG pipelines by leveraging structured prompts, multi-level semantic embedding, and conditional retrieval. This approach significantly enhances semantic precision, contextual alignment, and computational efficiency.

Limitations and Future Work

In this paper, we have proposed a structured prompt–based RAG framework for large-scale Big Data analytics and demonstrated its superiority over conventional RAG systems in terms of accuracy, efficiency, and diversity. By decomposing each user query into six distinct semantic elements—Speaker, Listener, Instruction, Format, Required, and Excluded—and applying PCA for dimensionality reduction, our method achieved a 32.3% improvement in the silhouette score while preserving the computational scalability.

However, our study encompasses three principal limitations:

Insufficient Nonlinear Semantic Preservation. PCA, as a linear projection technique, retains only dominant variance components of high-dimensional embeddings, potentially discarding nuanced nonlinear relationships that capture domain-specific semantics.
Absence of Dynamic Self-Improvement. Retrieval ranking and prompt structures are fixed (top-k), precluding reinforcement learning–based fine-tuning (e.g., RLHF) that could adapt the system to evolving user feedback and quality metrics.

Furthermore, beyond the directions previously outlined, we propose the following advances to further enhance the structured prompt–based RAG framework:

Development of Fine-grained Semantic Segmentation and Composition Algorithms: We introduce an adaptive schema that automatically extracts and classifies variable granular semantic elements beyond the six core categories (Speaker, Listener, Instruction, Format, Required, and Excluded) based on domain or task context. The dynamic design of composition rules, either rule-based or learning-based, governs the interaction among the elements of the prompt, thus improving the precision and contextual relevance in prompt interpretation.
Enhancement of Multi-conditional and Hierarchical Filtering: Building upon the existing multi-level filtering pipeline—which sequentially applies category filtering, Top-k selection, and redundancy elimination—we propose integrating additional evaluation criteria such as conditional value filtering and goal-oriented filtering (e.g., reliability, source diversity). We further suggest implementing an adaptive parameter tuning module that dynamically adjusts thresholds and weights at each filtering stage, ensuring robust consistency and accuracy even for complex queries.

By pursuing these avenues, the structured RAG framework can achieve enhanced precision, robustness, and scalability, thereby significantly improving its applicability in real-world industrial environments.

Author Contributions

Conceptualization, D.K.; Methodology, D.K.; Software, M.P.; Validation, M.P.; Investigation, J.L. and J.J.; Visualization, J.L.; Writing—review and editing, I.L.; Project administration, I.L.; Supervision, Y.S.; Writing—review and editing, Y.S.; Conceptualization, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2025-RS-2023-00254592) grant funded by the Korea government (MSIT).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Clissa, L.; Lassnig, M.; Rinaldi, L. How big is Big Data? A comprehensive survey of data production, storage, and streaming in science and industry. Front. Big Data 2023, 6, 1271639. [Google Scholar] [CrossRef] [PubMed]
Batko, K.; Ślęzak, A. The use of Big Data Analytics in healthcare. J. Big Data 2022, 9, 3. [Google Scholar] [CrossRef] [PubMed]
Schneider, J.; Meske, C.; Kuss, T. Foundation Models. Bus. Inf. Syst. Eng. 2024, 66, 3–11. [Google Scholar] [CrossRef]
Suresh, H.; Tseng, E.; Young, M.; Gray, M.; Pierson, E.; Levy, K. Participation in the Age of Foundation Models. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’24), Rio de Janeiro, Brazil, 3–6 June 2024; pp. 1–12. [Google Scholar]
Bhatia, G.; Nagoudi, E.M.B.; Cavusoglu, H.; Abdul-Mageed, M. FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13064–13087. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), PMLR, Virtual Event, 13–18 July 2020; pp. 3929–3938. [Google Scholar]
Knollmeyer, S.; Caymazer, O.; Koval, L.; Akmal, M.U.; Asif, S.; Mathias, S.G.; Großmann, D. Benchmarking of Retrieval Augmented Generation: A Comprehensive Systematic Literature Review on Evaluation Dimensions, Evaluation Metrics and Datasets. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024), Volume 3: KMIS, Porto, Portugal, 17–19 November 2024; pp. 137–148. [Google Scholar]
Ye, Q.; Beltagy, I.; Peters, M.E.; Ren, X.; Hajishirzi, H. FiD-ICL: A fusion-in-decoder approach for efficient in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 8158–8185. [Google Scholar]
Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot learning with retrieval augmented language models. J. Mach. Learn. Res. 2023, 24, 1–43. [Google Scholar]
Jiang, W.; Subramanian, S.; Graves, C.; Alonso, G.; Yazdanbakhsh, A.; Dadu, V. RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving. In Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25), Tokyo, Japan, 21–25 June 2025; ACM: New York, NY, USA; pp. 1–16. [Google Scholar] [CrossRef]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Peng, B.; Chersoni, E.; Hsu, Y.-Y.; Huang, C.-R. Is Domain Adaptation Worth Your Investment? Comparing BERT and FinBERT on Financial Tasks. In Proceedings of the Third Workshop on Economics and Natural Language Processing (ECONLP 2021), Punta Cana, Dominican Republic (and Online), 11 November 2021; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA; pp. 37–44. [Google Scholar]
Wang, J.; Li, X.; Shi, M.; Hu, R.; Liu, J.; Yang, X.; Liu, L. Review of large vision models and visual prompt engineering. Comput. Electr. Eng. 2023, 112, 108257. [Google Scholar] [CrossRef]
Choi, Y.; Kim, S.; Bassole, Y.C.F.; Sung, Y. Enhanced Retrieval-Augmented Generation Using Low-Rank Adaptation. Appl. Sci. 2025, 15, 4425. [Google Scholar] [CrossRef]
Choi, D.; Im, J.; Sung, Y. LoRA Fusion: Enhancing Image Generation. Mathematics 2024, 12, 3474. [Google Scholar] [CrossRef]
Cho, M.; Kim, S.; Choi, D.; Sung, Y. Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions. Appl. Sci. 2025, 15, 3712. [Google Scholar] [CrossRef]
Zhu, K.; Zhao, Q.; Chen, H.; Wang, J.; Xie, X. Promptbench: A unified library for evaluation of large language models. J. Mach. Learn. Res. 2024, 25, 1–22. [Google Scholar]
Singh, I.S.; Aggarwal, R.; Allahverdiyev, I.; Taha, M.; Akalin, A.; Zhu, K.; O’Brien, S. ChunkRAG: A Novel LLM-Chunk Filtering Method for RAG Systems. In Proceedings of the ICLR 2025 Workshop on Building Trust in Machine Learning, Singapore, 27–28 April 2025. [Google Scholar]
Lim, J.; Vogel-Heuser, B.; Kovalenko, I. Large Language Models (LLMs) for Smart Manufacturing and Industry X.0. In Artificial Intelligence for Smart Manufacturing and Industry X.0; Springer Series in Advanced Manufacturing; Springer Nature: Cham, Switzerland, 2025; Volume Part F138, pp. 97–119. [Google Scholar]
Li, D.; Li, F.; Song, B.; Tang, L.; Zhou, W. IMRRF: Integrating Multi-Source Retrieval and Redundancy Filtering for LLM-based Fake News Detection. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2025) (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 9127–9142. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 2014–2037. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.S.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Zhang, G.; Zhou, Y.; Bollegala, D. Evaluating Unsupervised Dimensionality Reduction Methods for Pretrained Sentence Embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20–25 May 2024; pp. 6530–6543. [Google Scholar]
Bei, Y.; Fang, Z.; Mao, S.; Yu, S.; Jiang, Y.; Tong, Y.; Cai, W. Manufacturing domain QA with integrated term enhanced RAG. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
Li, Y.; Wang, S.; Ding, H.; Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance (ICAIF ’23), Brooklyn, NY, USA, 27–29 November 2023; pp. 374–382. [Google Scholar]
Huang, A.H.; Wang, H.; Yang, Y. FinBERT: A large language model for extracting information from financial text. Contemp. Account. Res. 2023, 40, 806–841. [Google Scholar] [CrossRef]
Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
Yang, Q.; Zuo, H.; Su, R.; Su, H.; Zeng, T.; Zhou, H.; Wang, R.; Chen, J.; Lin, Y.; Chen, Z.; et al. Dual retrieving and ranking medical large language model with retrieval augmented generation. Sci. Rep. 2025, 15, 18062. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of proposed structured RAG framework. User inputs are transformed into a structured prompt utilizing predefined semantic categories. This is followed by category-specific embedding and dimensionality reduction via PCA. Semantic filtering is then applied to select contextually relevant candidates, while redundancy removal at the embedding level ensures output diversity.

Table 1. Comparison of dimensionality reduction methods: PCA, t-SNE, and UMAP.

Criteria	PCA	t-SNE	UMAP
Approach	Linear	Nonlinear	Nonlinear
Global structure preservation	Good	Poor	Moderate
Local structure preservation	Limited	Excellent	Excellent
Interpretability	High	Low	Low
Computational efficiency	High	Low	Moderate
Reproducibility	High	Low	Moderate
Embedding suitability for downstream tasks	Good	Limited	Limited
Inverse transformation possible	Yes	No	No

Table 2. 3 Structured prompts.

Category	Input #1	Input #2	Input #3
Speaker	Professor	Curriculum Developer	Educational Psychologist
Listener	Graduate Student	Teachers	School Counselor
Instruction	Thesis Supervision	New Course Implementation	Student Assessment Interpretation
Form	Regular Meeting	Workshop	Consultation Report
Required	Originality	Learning Objectives	Actionable Insights
Excluded	Plagiarism	Irrelevant Content	Misinterpretation

Table 3. Embedding filtering by the proposed method (top three values).

Category	Value #1	Value #2	Value #3
Speaker	Education Expert	Education Researcher	Curriculum Developer
Listener	Student	Parents	Academia
Instruction	Lecture Preparation	Education Program	Education Policy Analysis
Form	Lecture Notes	Proposal	Research Paper
Required	Comprehension	Participation	Objectivity
Excluded	Complex Terminology	Unnecessary Content	Bias

Table 4. Embedding filtering by the structured-format traditional RAG system (top three values).

Category	Value #1	Value #2	Value #3
Speaker	Professor	Education Expert	Researcher
Listener	Graduate Student	Prospective Students	Student
Instruction	Learning Guidance	Proofread Manuscript	Provide Training
Form	Seminar	One-on-one Session	Panel Discussion
Excluded	Grammar Errors	Profanity	Over-editing
Required	Originality	Creativity	Literary Quality

Table 5. LLM output of input #1 by the proposed method (top three recommendations).

Category	Recommend #1	Recommend #2	Recommend #3
Speaker	Educator	Researcher	Developer
Listener	Learner	Guardians	Academics
Instruction	Preparation	Program	Analysis
Form	Notes	Proposal	Paper
Excluded	Jargon	Irrelevance	Bias
Required	Understanding	Engagement	Neutrality

Table 6. LLM output of input #1 by the structured-format traditional RAG system (top three recommendations).

Category	Recommend #1	Recommend #2	Recommend #3
Speaker	Mentor	Facilitator	Advisor
Listener	Learner	Participant	Attendee
Instruction	Teach	Guide	Instruct
Form	Workshop	Lecture	Tutorial
Excluded	Bias	Plagiarism	Jargon
Required	Clarity	Engagement	Accuracy

Table 7. Silhouette scores for embedding clusters of structured prompts.

	Our Method		Traditional RAG
Seq	Diversity	Silhouette Score	Diversity	Silhouette Score
1	1	0.394	1	0.261
2	1	0.332	1	0.308
3	1	0.348	1	0.219
4	1	0.347	1	0.258
5	1	0.348	1	0.291

Table 8. Silhouette scores for final LLM response clusters.

	Our Method		Traditional RAG
Seq	Diversity	Silhouette Score	Diversity	Silhouette Score
1	1	0.416	1	0.292
2	1	0.442	1	0.322
3	1	0.498	1	0.272
4	1	0.480	1	0.288
5	1	0.431	1	0.293

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, D.; Park, M.; Lee, J.; Lee, I.; Jin, J.; Sung, Y. Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data. Mathematics 2025, 13, 2469. https://doi.org/10.3390/math13152469

AMA Style

Kim D, Park M, Lee J, Lee I, Jin J, Sung Y. Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data. Mathematics. 2025; 13(15):2469. https://doi.org/10.3390/math13152469

Chicago/Turabian Style

Kim, Donghyeon, Minki Park, Jungsun Lee, Inho Lee, Jeonghyeon Jin, and Yunsick Sung. 2025. "Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data" Mathematics 13, no. 15: 2469. https://doi.org/10.3390/math13152469

APA Style

Kim, D., Park, M., Lee, J., Lee, I., Jin, J., & Sung, Y. (2025). Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data. Mathematics, 13(15), 2469. https://doi.org/10.3390/math13152469

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Semantic Retrieval with Structured Prompt and Dimensionality Reduction for Big Data

Abstract

1. Introduction

2. Related Work

3. Structured Prompting and Semantic Filtering Framework

3.1. Overview

3.2. Structured Prompt Construction

3.3. Element-Specific Embedding and PCA

3.4. Embedding Filtering

3.5. Value Filtering

3.6. Domain-Specific Decision Support

3.7. Generalizability to Summarization, QA, and Multilingual Tasks

4. Experiments

4.1. Experimental Environment and Dataset

4.2. Evaluation Metric

4.3. Results

5. Conclusions

Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI