Next Article in Journal
An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise
Previous Article in Journal
TFR-LRC: Rack-Optimized Locally Repairable Codes: Balancing Fault Tolerance, Repair Degree, and Topology Awareness in Distributed Storage Systems
Previous Article in Special Issue
Cloud-Based Medical Named Entity Recognition: A FIT4NER-Based Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Secure Multifaceted-RAG: Hybrid Knowledge Retrieval with Security Filtering

1
Computer Science and Informatics, Atlanta Campus, Emory University, Atlanta, GA 30322, USA
2
Hyundai Motor Company, Seoul 06797, Republic of Korea
*
Authors to whom correspondence should be addressed.
Information 2025, 16(9), 804; https://doi.org/10.3390/info16090804
Submission received: 22 August 2025 / Revised: 9 September 2025 / Accepted: 12 September 2025 / Published: 16 September 2025

Abstract

Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG—achieving 79.3–91.9% win rates across correctness, richness, and helpfulness in LLM-based evaluation and 56.3–70.4% in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

1. Introduction

Retrieval-Augmented Generation (RAG) [1] has become a powerful tool for AI-driven content generation. However, existing RAG frameworks face significant limitations in enterprise applications. Traditional RAG systems rely heavily on internal document retrieval, which can lead to incomplete or inaccurate responses when relevant information is missing. Moreover, leveraging external Large Language Models (LLMs) like GPT [2], Claude [3], or DeepSeek [4] introduces security risks and high operational costs, making them less viable for enterprise deployment.
To address these challenges, we introduce the Secure Multifaceted-RAG (SecMulti-RAG) framework, which optimizes information retrieval, security, and cost efficiency. Our approach integrates three distinct sources: (1) a dynamically updated Enterprise Knowledge Base, (2) pre-written expert knowledge for anticipated queries, and (3) on-demand external knowledge, selectively retrieved when the user prompt is safe. For security, we introduce a filtering mechanism that ensures proprietary corporate data is not sent to external models. Furthermore, instead of relying on powerful closed-source LLMs, we use a local open-source model as the primary generator, selectively invoking external models only when user prompts are non-sensitive.
In this paper, we apply SecMulti-RAG to the Korean automotive industry, particularly for a report generation task. Our fine-tuned filter, retriever, and generation models show strong performance, ensuring the reliability of our approach. On the report generation task, our method outperforms the traditional RAG approach in correctness, richness, and helpfulness, as evaluated by both humans and LLMs. We also present adaptable strategies to meet specific deployment needs, emphasizing its flexibility. We aim to deploy our system in real-world enterprise environments. Our key contributions are
  • A multi-source RAG framework combining internal knowledge, pre-written expert knowledge, and external LLMs to enhance response completeness.
  • A confidentiality-preserving filter to prevent exposure of sensitive corporate data to external LLMs.
  • A cost-efficient approach that leverages high-quality retrieval to compensate for smaller local generation models.

2. Related Work

2.1. Enhancing Retrieval-Augmented Generation

Many efforts have been made to enhance RAG systems [5,6]. Jeong et al. [7] classify user prompts based on complexity to determine the optimal retrieval strategy, making their approach relevant to our filtering mechanism for selecting retrieval sources. Meanwhile, Yu et al. [8] and Wu et al. [9] replace traditional document retrieval with generative models. In particular, Wu et al. [9] propose a multi-source RAG (MSRAG) framework that integrates GPT-3.5 with web-based search, making it closely related to our work. In contrast, our approach retains internal document retrieval while integrating pre-generated expert knowledge and external LLMs.

2.2. Security Risks in LLM

As generative models are widely used, concerns about security and privacy risks continue to grow. Many studies have explored methods for detecting and mitigating the leakage of sensitive information [10,11,12,13]. For example, Chong et al. [14] present a prompt sanitization technique that enhances user privacy by identifying and removing sensitive information from user inputs before they are processed by LLM services. Zhou et al. [15] present a privacy-preserving framework, MixPi, to ensure no sensitive corporate data is exposed by cleverly obfuscating queries before calling external LLMs. Our study incorporates a user-prompt-filtering mechanism, ensuring a more secure retrieval process.
Compared to prior works that primarily combine internal retrieval with web search and closed-source LLM augmentation, our design explicitly triangulates three sources with distinct roles: (i) enterprise-internal documents for provenance-grounded facts, (ii) pre-written expert knowledge to mitigate missing-document gaps in enterprise settings, and (iii) on-demand external knowledge gated by a confidentiality filter. This separation allows us to decouple completeness (via expert and external knowledge) from confidentiality (via filtering and a local generator), while retaining conventional internal retrieval.

3. Method

As shown in Figure 1, our framework consists of three core components: multi-source retrieval, a confidentiality-preserving filtering mechanism, and local model adaptation.
(1)
Multi-Source Retrieval
Unlike conventional RAG frameworks that rely solely on internal structured document chunks, our system retrieves information from three distinct sources: (1) internal corporate documents, (2) pre-generated high-quality answers to anticipated queries, and (3) real-time external knowledge generated by closed-source LLMs. This multi-source retrieval strategy improves response completeness and accuracy, especially when internal documents are insufficient.
(2)
Confidentiality-Preserving Filtering Mechanism
To mitigate the risk of unintended data leakage when interacting with external closed-source LLMs, we introduce a query filtering mechanism that detects security-sensitive content. If a user query is classified as containing confidential information, external retrieval is skipped, and the system generates responses solely based on internal documents and pre-curated expert knowledge. This mechanism ensures data confidentiality while maintaining retrieval quality (Section 4).
(3)
Local Model Adaptation
Since enterprises often encounter security and cost limitations when using powerful closed-source LLMs, we use open-source Qwen-2.5-14B-Instruct [16] as our primary generation model. We fine-tune this model using domain-specific data to better reflect the language and knowledge of the Korean automotive domain (Section 6).
(4)
End-to-End RAG Pipeline
The final system consists of a multi-stage pipeline where user queries are processed through filtering, retrieval, and generation. By integrating high-quality retrieval with local model adaptation, our framework shows that a well-optimized retrieval system compensates for the limitations of smaller, locally deployed LLMs, making enterprise RAG both scalable and secure.

4. Confidentiality Filter

4.1. Dataset

Security-sensitive and general (non-sensitive) queries are created by Korean automotive engineers with the assistance of the Claude 3.7 Sonnet [3], accessed through a university-internal service built on AWS Bedrock (https://aws.amazon.com/ko/bedrock/, accessed on 5 January 2025). This setup provides secure access to Claude without exposing data to external LLM providers (all uses of the Claude 3.7 Sonnet model in this study—for data generation and LLM-based evaluation—were accessed exclusively through this secure university-internal service).
The prompts are designed to elicit three types of queries: (1) general queries that do not pose confidentiality risks, (2) security-sensitive queries (easy) containing explicit project names, and (3) security-sensitive queries (hard) that omit project names. Type (3) queries are challenging to classify even for expert engineers due to the absence of clear identifiers. In addition to the queries and the binary labels (sensitive or non-sensitive), brief rationales are generated to explain the reasoning behind each label. Table 1 summarizes data statistics. Figure A1 illustrates the prompt template used to construct the query dataset, and Figure A3 provides the prompt that we use to filter out user queries that should not be exposed to closed-source LLMs.

4.2. Model

For the filter, we fine-tune a lightweight model, Qwen2.5-3B-Instruct [16], to classify safe and unsafe queries. It is fine-tuned for three epochs with a learning rate of 8 × 10 6 , using a batch size of 2 and gradient accumulation of 256 to simulate a large batch. Training is conducted on 3 × RTX A6000 (48 GB).

4.3. Evaluation

As shown in Table 2, we evaluate the filter model on two subsets: Easy-only (queries with explicit project names) and Easy&Hard (both with and without project names). Security-sensitive queries (class 0) are treated as the positive class, making recall critical in preventing information leakage to external LLMs. For easy test cases, the filter achieves 99.01% recall, indicating strong performance with minimal false negatives. When ambiguous queries are included, accuracy drops to 82.31% and recall to 74.35%, while precision remains high (97.93%), meaning that when it does flag a query as sensitive, it is highly likely to be correct. To estimate the upper bound, we conduct human evaluation on the Easy&Hard set. The expert annotator achieves 80.95% accuracy, 92.99% precision, and 76.44% recall, highlighting the intrinsic ambiguity and difficulty of the task.

4.4. Application

In practical deployment, constructing a labeled dataset of safe and unsafe queries for filter training can be labor-intensive. To address this, we propose a progressive deployment strategy for the confidentiality filter. Initially, the system operates without filtering or external retrieval, relying only on internal documents and pre-generated expert knowledge. During this phase, real user queries are collected and later labeled to train a filter model, enabling cost-effective integration over time. Alternatively, the filter can perform query rewriting—flagged queries are transformed into safer versions, allowing secure forwarding to external LLMs. This flexible design supports scalable adaptation to organizational privacy and deployment needs.

5. Retrieval

This paper focuses on generating reports for enterprise-level engineering problems. Our retrieval system is built upon 6165 chunked documents, consisting of 5625 chunks from the Enterprise Knowledge Base and 540 from Pre-written Expert Knowledge. To prevent data leakage, only the training subset of the 675 Pre-written Expert Knowledge documents is included in the retrieval pool; the remaining 135 keyword–report pairs are reserved for the final evaluation of SecMulti-RAG (Section 7). All documents listed in Table 3 are indexed using FAISS [17], and our trained retriever retrieves the top five most relevant documents for each query.

5.1. Dataset

5.1.1. Enterprise Knowledge Base

For the Enterprise Knowledge Base dataset, we use the dataset introduced by Choi et al. [18]. As shown in Table 4, it consists of test reports, meeting reports, and the textbook Crash Safety of Passenger Vehicles [19], with each document segmented into meaningful units such as slides, chapters, or other relevant sections. QA pairs from the reports and textbook are used to train both the retriever and the generation model.

5.1.2. Pre-Written Expert Knowledge

To construct a high-quality knowledge source, a domain expert in automotive engineering first curates a list of domain-specific keywords, representing expertise-level problems. Using these keywords, the expert then generates pre-written expert knowledge. A total of 675 keyword–report pairs are partitioned into training, validation, and test splits in an 8:1:1 ratio. Below are some examples of the keywords we use to generate pre-written expert knowledge. Reports related to these keywords (main problems) are pre-generated by automotive engineers (see Figure A2 in Appendix A for details):
  • A필러 상부 강성 부족 및 변형 (Insufficient upper stiffness and deformation in A-pillar);
  • A필러 힌지 크랙 발생 (Hinge crack formation in A-pillar);
  • A필러와 대시 연결부 찢어짐 (Tearing at the connection between the A-pillar and the dashboard);
  • A필러 변형으로 인한 윈드실드 파손 (Windshield damage due to A-pillar deformation).

5.1.3. External Knowledge from LLM

After the user query passes the safety filter and is deemed safe, we use GPT-4o (https://platform.openai.com/docs/models/gpt-4o, accessed on 5 January 2025) to provide on-demand external knowledge. Specifically, in this paper, it generates a general-purpose technical background document to assist engineers in drafting formal safety reports. The generated document is then indexed into the document pool for future retrieval. The prompt is illustrated in Figure 2.

5.2. Retriever

We fine-tune BGE-M3 [20], a multilingual encoder supporting Korean, using QA and keyword–report pairs (Section 5.1.1 and Section 5.1.2). Training is conducted for 10 epochs on 4 × RTX A6000 (48 GB) with publicly available code [20,21,22,23]. We use dense retrieval only and do not apply a reranker; top-k candidates are taken directly from the vector index. For evaluation, we use all splits (training, validation, and test) as the chunk pool to ensure sufficient data coverage and mitigate potential biases due to the small size of the test set.

5.3. Retriever Evaluation

The performance of the retriever is evaluated using Mean Average Precision (MAP@k), which calculates the average precision of relevant results up to rank k. As shown in Table 5, fine-tuning BGE-M3 leads to significant improvements, underscoring the importance of task-specific adaptation.

5.4. Document Selection Strategy

In this study, we rank candidate documents by semantic similarity and apply a selection constraint: at most one external knowledge source is included per query. GPT-generated documents are limited to one per query as they provide only general technical background, and excessive reliance on such external content may reduce the factual grounding of responses. Limiting external LLM-generated background to one document per query reduces (i) context dilution and (ii) hallucination risks, while keeping auditability of non-proprietary content. Although this constraint is currently implemented via heuristic rules, we aim to develop a learning-based document selection strategy that jointly considers query characteristics and document provenance.

6. Generation

6.1. Generator

We use Qwen-2.5-14B-Instruct [16] as our base language model, as it is one of the few multilingual models that officially support Korean while offering a sufficient context length. Full fine-tuning of the 14B model was performed for three epochs using 3 × H100 (80 GB), with a batch size of 2, gradient accumulation of 64, and a learning rate of 2 × 10 5 . We conduct LoRA fine-tuning using QA pairs introduced in Section 5.1.1.

6.2. Result

Table 6 summarizes the retrieved document sources and filtering results for the 135 test queries. All queries retrieved pre-generated expert knowledge, while 28.1% and 18.5% also retrieved GPT-generated and internal documents, respectively. Among the test queries, 34 are classified as safe, enabling on-demand GPT generation. Interestingly, the number of queries retrieving GPT-generated documents slightly exceeds the number of safe queries. This is because previously generated external documents remain in the retrieval pool and can still be retrieved for relevant future queries, even if those queries are classified as sensitive. This illustrates a key benefit of our system: as more external knowledge is accumulated over time, the retrieval pool becomes richer, allowing even sensitive queries to benefit from external knowledge without compromising security.

7. Evaluation

7.1. Method

To evaluate the effectiveness of our approach, we conduct a qualitative assessment based on three metrics: correctness, richness, and helpfulness. We perform pairwise comparisons using both LLM-as-a-judge [24] and human evaluation, comparing responses generated by Traditional RAG—which retrieves only from the internal knowledge base—with our SecMulti-RAG, which retrieves from the internal knowledge base, pre-generated expert knowledge, and on-demand external knowledge. An expert-generated test dataset asks for report generation regarding the expert-level automotive domain. For each metric, Claude and a human annotator assess which response (A or B) is better and record the outcome as a win, loss, or tie. To mitigate position bias from the judge LLM, we anonymize the response order by randomly assigning either SecMulti-RAG or Traditional RAG as response A or B in half of the cases (the prompt that we use for LLM-based evaluation is in Figure A4):
  • Correctness assesses the factual consistency with the given gold answer. The pre-written reports are provided as gold answers.
  • Richness evaluates the level of detail and completeness in the response.
  • Helpfulness measures how clear, informative, and useful the response is.

7.2. Result

Figure 3 presents a comparison of win rates between the two systems, evaluated by human annotators and an LLM-based evaluation. Both evaluation sources consistently prefer the outputs of SecMulti-RAG, particularly in the richness metric.

7.3. Analysis

While human evaluators tend to assign ‘tie’ labels more frequently than the judge LLM, both clearly favor SecMulti-RAG across all metrics. Figure 4 illustrates an example of reports generated by SecMulti-RAG and Traditional RAG.
As expected, richness is the most notably improved aspect of our framework, reflecting the benefit of incorporating diverse documents from multiple sources. SecMulti-RAG outputs contain significantly more detailed information, such as more diverse test cases and technical findings. In fact, the average length of reports generated by SecMulti-RAG is 2660.21 tokens, compared to 1631.84 tokens from Traditional RAG. In terms of correctness, both systems generally produce factually accurate content but occasionally fail to cite the correct source document. For helpfulness, while our approach delivers more comprehensive reports, Traditional RAG may be more favorable when the engineer’s intent is to focus on a specific test case, as its responses tend to be more narrowly scoped. This could potentially be mitigated through prompt tuning. One notable issue is the occasional generation of Chinese characters, due to the Qwen model’s Chinese-centric pretraining. This issue is likely caused by longer retrieved documents increasing the context length, which in turn makes the model more prone to such errors.
Table 7 reports the agreement rates and Gwet’s AC1 scores. Gwet’s AC1 is used instead of Cohen’s Kappa due to class imbalance in the evaluation, where SecMulti-RAG is consistently favored over Traditional RAG. The overall agreement between Claude and human judgments is substantial, particularly in richness and helpfulness. The correctness dimension shows relatively lower agreement, largely because the LLM tends to avoid assigning “tie” labels, which human annotators use more frequently. See Figure 5 for confusion matrices.

7.4. Application

In this study, we primarily evaluate our RAG framework on the report generation task. However, in practice, the framework is scalable to various tasks and domains. We conduct a preliminary test with a few engineering questions in the automotive domain, as shown in Figure A5 in Appendix B. The SecMulti-RAG responses include specific injury types, structural causes, and implications for official safety assessments, demonstrating greater richness and helpfulness compared to the Traditional RAG responses. This shows that our framework is scalable beyond report generation.
Furthermore, our framework is also cost-effective. For instance, generating lengthy reports with an average of 2660 tokens per query costs only about USD 2–3 in total for 100 queries when using GPT-4o selectively, highlighting the practicality of SecMulti-RAG for enterprise deployment.

8. Limitations

Pre-written expert knowledge can introduce topical bias: issues not anticipated by experts may be underrepresented. Also, there may be increased response latency due to the additional filtering stage and multi-source retrieval process. The filter trades recall for safety; false negatives pose a leakage risk, while false positives reduce recall of helpful external knowledge. Lastly, due to the lack of publicly available Korean-language datasets in the automotive domain, our evaluation is limited to the report generation task based on a relatively small amount of data that we have constructed ourselves. While this work is intended for an industry track and demonstrates practical significance, future research could enhance the academic impact by showing the scalability of the SecMulti-RAG framework across a broader range of tasks and domains. In fact, we have conducted preliminary experiments on engineering question answering tasks beyond report generation and observed that SecMulti-RAG also performs well in those scenarios, indicating its potential.

9. Conclusions

In this paper, we present SecMulti-RAG for enterprises that integrate internal knowledge bases, pre-generated expert knowledge, and on-demand external knowledge. Our framework introduces a confidentiality-aware filtering mechanism that protects security-sensitive user prompts by bypassing external augmentation when necessary, mitigating the risk of information leakage to closed-source LLMs. In our experiments on automotive engineering report generation, SecMulti-RAG showed clear improvements over Traditional RAG in terms of correctness, richness, and helpfulness. It achieved win rates ranging from 56.3% to 70.4%, as evaluated by human evaluators, outperforming traditional RAG across all metrics. Beyond performance gains, our approach is a cost-efficient, privacy-preserving, and scalable solution, leveraging high-quality retrieval with locally hosted LLMs.

Author Contributions

Conceptualization, G.B. and J.D.C.; Data curation, S.L.; Formal analysis, G.B.; Funding acquisition, J.D.C.; Investigation, G.B.; Methodology, G.B. and J.D.C.; Project administration, S.L. and J.D.C.; Resources, S.L.; Software, G.B. and N.C.; Supervision, J.D.C.; Validation, S.L. and N.C.; Visualization, G.B.; Writing—original draft, G.B.; Writing—review and editing, G.B., S.L., N.C. and J.D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available due to confidentiality agreements with Hyundai Motor Company.

Acknowledgments

We gratefully acknowledge the support of the Hyundai Motor Company. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Hyundai Motor Company.

Conflicts of Interest

Shinsun Lee was employed by “Hyundai Motor Company”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
LLMLarge Language Model
RAGRetrieval-Augmented Generation
BGEBAAI (Beijing Academy of Artificial Intelligence) General Embedding
SecMulti-RAGSecure Multifaceted-RAG

Appendix A. Prompts

Figure A1 shows the prompt for query data generation.
To generate pre-written expert knowledge, Korean automotive engineers employ the prompt shown in Figure A2, refined through extensive prompt tuning. The generated reports include a structured composition of problem definition, technical analysis, case analysis, improvement suggestions, and relevant references, and they are subsequently reviewed by domain experts. The keywords provided in Section 5.1.2 serve as the core problem topics addressed in each report. University-internal Claude service from AWS Bedrock is used for security.
Figure A3 presents the prompt for classifying user queries by the presence of sensitive information, and Figure A4 depicts the prompt for LLM-based evaluation.
Figure A1. Prompt used for query data generation (translated to English).
Figure A1. Prompt used for query data generation (translated to English).
Information 16 00804 g0a1
Figure A2. Prompt used for generating gold reports. The prompt is originally Korean.
Figure A2. Prompt used for generating gold reports. The prompt is originally Korean.
Information 16 00804 g0a2
Figure A3. Prompt used for filtering progress (translated to English).
Figure A3. Prompt used for filtering progress (translated to English).
Information 16 00804 g0a3
Figure A4. Prompt used for LLM-based evaluation.
Figure A4. Prompt used for LLM-based evaluation.
Information 16 00804 g0a4

Appendix B. Framework Application

In this study, we primarily evaluate our RAG framework on the report generation task. However, in practice, the framework is scalable to various tasks and domains. We conducted a preliminary test with a few engineering questions in the automotive domain, as shown in Figure A5. The SecMulti-RAG responses include specific injury types, structural causes, and implications for official safety assessments, demonstrating greater richness and helpfulness compared to the Traditional RAG responses.
Figure A5. Comparison between Traditional RAG and SecMulti-RAG in QA task (translated from Korean).
Figure A5. Comparison between Traditional RAG and SecMulti-RAG in QA task (translated from Korean).
Information 16 00804 g0a5

References

  1. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20), Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
  2. OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  3. Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. 2024. Available online: https://api.semanticscholar.org/CorpusID:268232499 (accessed on 5 January 2025).
  4. DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  5. Zhou, Y.; Liu, Z.; Dou, Z. AssistRAG: Boosting the Potential of Large Language Models with an Intelligent Information Assistant. arXiv 2024, arXiv:2411.06805. [Google Scholar] [CrossRef]
  6. Gutiérrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv 2025, arXiv:2405.14831. [Google Scholar]
  7. Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. arXiv 2024, arXiv:2403.14403. [Google Scholar]
  8. Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M. Generate rather than Retrieve: Large Language Models are Strong Context Generators. arXiv 2023, arXiv:2209.10063. [Google Scholar] [CrossRef]
  9. Wu, R.; Chen, S.; Su, X.; Zhu, Y.; Liao, Y.; Wu, J. A Multi-Source Retrieval Question Answering Framework Based on RAG. arXiv 2024, arXiv:2405.19207. [Google Scholar] [CrossRef]
  10. Zhang, S.; Ye, L.; Yi, X.; Tang, J.; Shui, B.; Xing, H.; Liu, P.; Li, H. “Ghost of the past”: Identifying and resolving privacy leakage from LLM’s memory through proactive user interaction. arXiv 2024, arXiv:2410.14931. [Google Scholar]
  11. Kim, S.; Yun, S.; Lee, H.; Gubri, M.; Yoon, S.; Oh, S.J. ProPILE: Probing Privacy Leakage in Large Language Models. arXiv 2023, arXiv:2307.01881. [Google Scholar] [CrossRef]
  12. Hayes, J.; Melis, L.; Danezis, G.; Cristofaro, E.D. LOGAN: Evaluating Privacy Leakage of Generative Models Using Generative Adversarial Networks. arXiv 2017, arXiv:1705.07663. [Google Scholar]
  13. Lukas, N.; Salem, A.; Sim, R.; Tople, S.; Wutschitz, L.; Zanella-Béguelin, S. Analyzing Leakage of Personally Identifiable Information in Language Models. arXiv 2023, arXiv:2302.00539. [Google Scholar] [CrossRef]
  14. Chong, C.J.; Hou, C.; Yao, Z.; Talebi, S.M.S. Casper: Prompt Sanitization for Protecting User Privacy in Web-Based Large Language Models. arXiv 2024, arXiv:2408.07004. [Google Scholar] [CrossRef]
  15. Zhou, X.; Lu, Y.; Ma, R.; Gui, T.; Zhang, Q.; Huang, X. TextMixer: Mixing Multiple Inputs for Privacy-Preserving Inference. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3749–3762. [Google Scholar] [CrossRef]
  16. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2. 5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  17. Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2025, arXiv:2401.08281. [Google Scholar]
  18. Choi, N.; Byun, G.; Chung, A.; Paek, E.S.; Lee, S.; Choi, J.D. Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents. arXiv 2025, arXiv:2502.19596. [Google Scholar]
  19. Mizuno, K. Crash Safety of Passenger Vehicles; Translated from Japanese by Kyungwon Song; Reviewed by Inhwan Han, Jongjin Park, Sungjin Kim, Jongchan Park, and Namgyu Park; Bomyung Books: Seoul, Republic of Korea, 2016. [Google Scholar]
  20. Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand and Virtual Meeting, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
  21. Xiao, S.; Liu, Z.; Zhang, P.; Xing, X. LM-Cocktail: Resilient Tuning of Language Models via Model Merging. arXiv 2023, arXiv:2311.13534. [Google Scholar] [CrossRef]
  22. Zhang, P.; Xiao, S.; Liu, Z.; Dou, Z.; Nie, J.Y. Retrieve Anything To Augment Large Language Models. arXiv 2023, arXiv:2310.07554. [Google Scholar] [CrossRef]
  23. Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv 2023, arXiv:2309.07597. [Google Scholar]
  24. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv 2023, arXiv:2306.05685. [Google Scholar]
Figure 1. SecMulti-RAG framework. After the filtering prompt is invoked, the system routes the query either to the external-LLM branch (if classified as safe) or to the internal+expert-only branch (if classified as sensitive).
Figure 1. SecMulti-RAG framework. After the filtering prompt is invoked, the system routes the query either to the external-LLM branch (if classified as safe) or to the internal+expert-only branch (if classified as sensitive).
Information 16 00804 g001
Figure 2. Prompt used for generating on-demand external knowledge using GPT-4o.
Figure 2. Prompt used for generating on-demand external knowledge using GPT-4o.
Information 16 00804 g002
Figure 3. Win rate comparison between SecMulti-RAG and Traditional RAG across evaluation metrics.
Figure 3. Win rate comparison between SecMulti-RAG and Traditional RAG across evaluation metrics.
Information 16 00804 g003
Figure 4. Comparison between Traditional RAG and SecMulti-RAG (translated from Korean; sensitive information anonymized).
Figure 4. Comparison between Traditional RAG and SecMulti-RAG (translated from Korean; sensitive information anonymized).
Information 16 00804 g004
Figure 5. Confusion matrices showing the agreement between LLMs and human evaluations. Most counts lie in the diagonal cells, indicating consistent agreement between both evaluators.
Figure 5. Confusion matrices showing the agreement between LLMs and human evaluations. Most counts lie in the diagonal cells, indicating consistent agreement between both evaluators.
Information 16 00804 g005
Table 1. Data split for training and evaluation of the filter model.
Table 1. Data split for training and evaluation of the filter model.
SetGeneralUnsafeTotal
EasyHard
Train8208007202340
Validation10210090292
Test10310190294
Table 2. Evaluation results of the filter model and human annotators on the test set. Easy-onlyincludes security-sensitive queries with explicit project names. Easy&Hardincludes both easy (with project names) and hard (without project names) queries. Humanshows expert-labeled upper bound performance.
Table 2. Evaluation results of the filter model and human annotators on the test set. Easy-onlyincludes security-sensitive queries with explicit project names. Easy&Hardincludes both easy (with project names) and hard (without project names) queries. Humanshows expert-labeled upper bound performance.
MethodTest DataAcc %PrecRec
FilterEasy-only98.0497.0999.01
Easy&Hard82.3197.9374.35
HumanEasy&Hard80.9592.9976.44
Table 3. Overview of chunked documents used in SecMulti-RAG retrieval. Traditional RAG retrieves only from the Enterprise Knowledge Base. The column “File/Page” indicates the number of files for test reports and meeting reports and the number of pages for the textbook.
Table 3. Overview of chunked documents used in SecMulti-RAG retrieval. Traditional RAG retrieves only from the Enterprise Knowledge Base. The column “File/Page” indicates the number of files for test reports and meeting reports and the number of pages for the textbook.
TypeSourceFile/PageChunks
Enterprise Knowledge BaseTest Report14634662
Meeting Report249882
Textbook40481
Pre-written Expert KnowledgeGold Report-540
Total 6165
Table 4. Enterprise Knowledge Base dataset: statistics by source, detailing the distribution of chunks and QA pairs.
Table 4. Enterprise Knowledge Base dataset: statistics by source, detailing the distribution of chunks and QA pairs.
SourceDataTrainValTestTotal
Test ReportChunk37294664674662
QA Pair47,6605823591959,402
Meeting ReportChunk7058889882
QA Pair61447528007696
TextbookChunk648981
QA Pair11821621611505
Table 5. Comparison of retrieval performance between the vanilla and fine-tuned models on our test dataset.
Table 5. Comparison of retrieval performance between the vanilla and fine-tuned models on our test dataset.
ModelMAP@1MAP@5MAP@10
BGE (Vanilla)0.28550.37930.3925
BGE (Fine-tuned)0.59650.70270.7099
Note. Bold values indicate the best performance in each column.
Table 6. Distribution of retrieved document types (appearing at least once among the top 5 documents) and filtering outcomes in the SecMulti-RAG.
Table 6. Distribution of retrieved document types (appearing at least once among the top 5 documents) and filtering outcomes in the SecMulti-RAG.
CategoryCount (%)
Pre-written knowledge retrieved135 (100%)
External knowledge retrieved38 (28.1%)
Internal document retrieved25 (18.5%)
Filter = 1 (safe)34 (25.2%)
Filter = 0 (security-sensitive)101 (74.8%)
Table 7. Agreement between Claude and human evaluation results.
Table 7. Agreement between Claude and human evaluation results.
MetricAgreement (%)Gwet’s AC1
Correctness56.74%0.4295
Richness73.05%0.6812
Helpfulness69.50%0.6264
Overall72.34%0.6647
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Byun, G.; Lee, S.; Choi, N.; Choi, J.D. Secure Multifaceted-RAG: Hybrid Knowledge Retrieval with Security Filtering. Information 2025, 16, 804. https://doi.org/10.3390/info16090804

AMA Style

Byun G, Lee S, Choi N, Choi JD. Secure Multifaceted-RAG: Hybrid Knowledge Retrieval with Security Filtering. Information. 2025; 16(9):804. https://doi.org/10.3390/info16090804

Chicago/Turabian Style

Byun, Grace, Shinsun Lee, Nayoung Choi, and Jinho D. Choi. 2025. "Secure Multifaceted-RAG: Hybrid Knowledge Retrieval with Security Filtering" Information 16, no. 9: 804. https://doi.org/10.3390/info16090804

APA Style

Byun, G., Lee, S., Choi, N., & Choi, J. D. (2025). Secure Multifaceted-RAG: Hybrid Knowledge Retrieval with Security Filtering. Information, 16(9), 804. https://doi.org/10.3390/info16090804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop