BioChat: A Domain-Specific Biodiversity Question-Answering System to Support Sustainable Conservation Decision-Making
Abstract
1. Introduction
1.1. Research Questions
1.2. Main Contributions
- Quantitative Component Analysis: We systematically evaluate how domain-specific embedding models, routing strategies, and generator LLMs influence hallucination rates and factual accuracy in biodiversity-oriented question answering.
- Empirical Routing Threshold Optimization: We propose a reproducible methodology for determining an optimal routing threshold (τ = 0.02) through statistical analysis of re-ranker score distributions, enabling precision-oriented routing that effectively filters irrelevant or noisy queries.
- LLM Benchmark for Biodiversity: We establish a novel benchmark comparing six open-source LLMs (including Qwen2, Gemma-2, and Llama-3.1), demonstrating that instruction-following capability is more critical than model size for scientific reliability.
- Sustainable-Oriented System Blueprint: We present a high-fidelity, reproducible RAG framework designed to support environmental education and evidence-based conservation decision-making by reducing barriers to accessing expert-verified biodiversity data.
2. Related Work
2.1. Limitations of Large Language Models in Scientific Domains
2.2. Evolution of Retrieval-Augmented Generation (RAG)
2.3. Challenges in Biodiversity Informatics
3. Materials and Methods
3.1. Proposed High-Fidelity RAG Pipeline: BioChat
3.1.1. Indexing Pipeline
3.1.2. Query Processing Pipeline
3.1.3. Human-in-the-Loop Feedback Mechanism
3.1.4. System Prototype
3.2. Evaluation Setup
3.2.1. Evaluation Dataset Construction
3.2.2. Baseline Systems
- Baseline 1: Naive RAG (no routing) represents a simple pipeline that always retrieves the top three documents based solely on bi-encoder similarity, forcing the RAG path regardless of query relevance.
- Baseline 2: RAG + L2 Distance Threshold Routing introduces a routing mechanism that decides between the RAG and fallback paths using the L2 squared distance obtained from ChromaDB retrieval results. The threshold value (L2_THRESHOLD = 110.0) was determined empirically from preliminary distance distribution analysis.
- Proposed: RAG + Reranker Threshold Routing represents the full pipeline described in Section 3.1.2, where the routing decision is based on the reranker score threshold (RERANK_THRESHOLD = 0.02) derived from score distribution analysis (Section 3.2.1). This configuration leverages cross-encoder-based semantic scoring to more precisely distinguish contextually relevant queries from irrelevant ones.
3.2.3. Selection of Generative Language Models
3.2.4. Evaluation Metrics
- Retrieval Performance
- 2.
- Routing Performance
- 3.
- Final Answer Performance
3.3. Implementation Details
4. Results
4.1. Experiment 1: Embedding Model Performance
4.2. Experiment 2: Routing Strategy Performance
Final Answer Quality
4.3. Experiment 3: LLM Model Comparison
5. Discussion
5.1. Summary of Findings
5.2. Implications and Contributions
5.3. Limitations and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial intelligence |
| AUC | Area under the curve |
| Bi-Encoder | Bidirectional encoder |
| BnB | Bits and bytes quantization |
| CE | Cross-Encoder |
| ChromaDB | Chroma Database |
| GBIF | Global Biodiversity Information Facility |
| GPU | Graphics Processing Unit |
| k-NN | k-Nearest Neighbors |
| LLM | Large language model |
| MRR | Mean reciprocal rank |
| NIBR | National Institute of Biological Resources |
| NLP | Natural Language Processing |
| QA | Question answering |
| Q&A | Question–Answer |
| RAG | Retrieval-Augmented Generation |
| Recall@K | Recall at K |
| ROC | Receiver Operating Characteristic |
| RQ | Research question |
| SBERT | Sentence-BERT |
| UI | User interface |
Appendix A
- Query Templates and Example Questions
- Type A template (RAG):
- System: “You are a biodiversity expert AI assistant using verified NIBR data.”
- User: “{{question}}”
- Type B template (Fallback):
- System: “You are a general-purpose assistant.”
- User: “{{question}}”
- Example Type A queries (RAG):
- “What is the scientific name of the Suwon tree frog?”
- “Where is Drosera rotundifolia distributed in Korea?”
- “Which species of genus Quercus are endemic to Jeju Island?”
- Example Type B queries (Fallback):
- “Who proposed the theory of evolution?”
- “What is the average temperature of the Amazon rainforest?”
Appendix B
- Annotation Protocol and Evaluation Criteria
| LLM Model | Definition | Criteria for “1” (Positive) | Criteria for “0” (Negative) |
|---|---|---|---|
| Factual Accuracy | Whether the answer is factually correct and grounded in context | Consistent with ground truth or NIBR data | Incorrect or unverifiable |
| Hallucination | Whether the answer contains fabricated or unsupported content | Contains any non-grounded statement | Fully grounded or empty |
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Liu, Y.; Smith, G.S.; Liu, Z.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 428. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. IEEE Consum. Electron. Mag. 2024, 13, 62–75. [Google Scholar]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. REALM: Retrieval-Augmented Language Model Pre-training. In Proceedings of the 37th International Conference on Machine Learning (ICML’20), Virtual, 13–18 July 2020; pp. 3887–3898. [Google Scholar]
- Zhang, W.; Zhang, J. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Mathematics 2025, 13, 856. [Google Scholar] [CrossRef]
- Hofstätter, S.; Lin, S.; Yang, J.-H.; Khattab, O.; Mitra, B.; Hanbury, A. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR ‘21, Virtual Event, 11–15 July 2021; pp. 113–122. [Google Scholar]
- Thessen, A.E.; Cui, H.; Mozzherin, D. Applications of Natural Language Processing in Biodiversity Science. Adv. Bioinform. 2012, 2012, 391574. [Google Scholar] [CrossRef]
- Patterson, D.J.; Egloff, W.; Agosti, D.; Eades, D.; Haskins, N.; Hyam, R.; Mozzherin, D.; Shorthouse, D.P.; Thessen, A. Challenges with Using Names to Link Digital Biodiversity Information. BMC Bioinform. 2016, 17, 122–128. [Google Scholar] [CrossRef]
- Zizka, A.; Antunes Carvalho, F.; Vergara, D.; Calvente, A.; Baez-Lizarazo, M.R.; Cabral, A.; Coelho, J.F.R.; Colli-Silva, M.; Fantinati, M.R.; Fernandes, M.F.; et al. No One-Size-Fits-All Solution to Clean GBIF. Methods Ecol. Evol. 2020, 11, 1117–1122. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Technical Report. 2019. Available online: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (accessed on 25 December 2025).
- Nogueira, R.; Cho, K. Passage Re-Ranking with BERT. In Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation, Santiago de Compopstela, Spain, 7 September 2020. [Google Scholar]
- Yu, Y.; Ping, W.; Liu, Z.; Wang, B.; You, J.; Zhang, C.; Shoeybi, M.; Catanzaro, B. RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs. Adv. Neural Inf. Process. Syst. 2024, 37, 121156–121184. [Google Scholar]
- Johnson, J.; Douze, M.; Jégou, H. Billion-Scale Similarity Search with FAISS. IEEE Trans. Big Data 2021, 7, 535–547. [Google Scholar] [CrossRef]
- Wang, J.; Wang, W.; Ma, R.; Wang, Z.; Song, J. A Comprehensive Survey on Vector Databases. Proc. VLDB Endow. 2024, 17, 4116–4129. [Google Scholar]
- Wang, Z.; Liang, Z.; Shao, Z.; Ma, Y.; Dai, H.; Chen, B.; Mao, L.; Lei, C.; Ding, Y.; Li, H. InfoGain-RAG: Boosting Retrieval-Augmented Generation through Document Information Gain-based Reranking and Filtering. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025. [Google Scholar]
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
- Dettmers, T.; Lewis, M.; Shleifer, S.; Zettlemoyer, L. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. Int. Conf. Learn. Represent. 2023, 35, 30318–30332. [Google Scholar]
- Gemma Team; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving Open Language Models at a Practical Size. arXiv 2024, arXiv:2408.00118. [Google Scholar] [CrossRef]
- Dubey, A.; Jauhri, A.; Pandey, A.; Keshwam, A.; Faulkner, A.; Holt, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
- Saad-Falcon, J.; Finlayson, M.G.; Saggere, A.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. Adv. Neural Inf. Process. Syst. 2023, 36, 33965–33979. [Google Scholar]
- Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Topsakal, O.; Akinci, T.Ç. Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast. In Proceedings of the International Conference on Applied Engineering and Natural Sciences, Konya, Turkey, 10–12 July 2023; pp. 1050–1056. [Google Scholar]
- Park, Y.; Shin, Y. Using Multiple Monolingual Models for Efficiently Embedding Korean and English Conversational Sentences. Appl. Sci. 2023, 13, 5771. [Google Scholar] [CrossRef]
- Fureby, L. Domain Adaptation of Retrieval Systems from Unlabeled Corpora. Master’s Thesis, Lund University, Lund, Sweden, 2024. [Google Scholar]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
- Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval. Adv. Neural Inf. Process. Syst. 2021, 34, 25206–25222. [Google Scholar]
- Thakur, N.; Reimers, N.; Daxenberger, J.; Gurevych, I. Augmented SBERT: Data Augmentation Method for Improv-ing Bi-Encoders for Pairwise Sentence Scoring Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 296–310. [Google Scholar]
- Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense Passage Retrieval for Open-Domain Question Answering. Adv. Neural Inf. Process. Syst. 2020, 33, 2895–2907. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; et al. Finetuned Language Models Are Zero-Shot Learners. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
- Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. LIMA: Less Is More for Alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 22165–22185. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Ziegler, D.M.; Stiennon, N.; Brown, T.; Radford, A.; Wu, J.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-Tuning Language Models from Human Preferences. arXiv 2019, arXiv:1909.08593. [Google Scholar]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Adv. Neural Inf. Process. Syst. 2023, 36, 53728–53741. [Google Scholar]
- Long, Z.; Chen, J.; Zhou, Y.; Zhang, R. Retrieval-Augmented Domain Adaptation via In-Context Learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7250–7264. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. SimCSE: Simple Contrastive Learning of Sentence Embeddings. Adv. Neural Inf. Process. Syst. 2021, 34, 16973–16985. [Google Scholar]
- Hu, E.J.; Shen, J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, L.; Wang, X. LoRA: Low-Rank Adaptation of Large Language Models. Int. Conf. Learn. Represent. 2022, 1, 3. [Google Scholar]
- Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-Context Retrieval-Augmented Language Models. Trans. Assoc. Comput. Linguist. 2024, 12, 19–37. [Google Scholar] [CrossRef]
- Yasunaga, M.; Aghajanyan, A.; Shi, W.; James, R.; Leskovec, J.; Liang, P.; Lewis, M.; Zettlemoyer, L.; Yih, W. Retrieval-Augmented Multimodal Language Modeling. Proc. Int. Conf. Mach. Learn. (ICML) 2023, 202, 39755–39769. [Google Scholar]







| Type | Definition | Routing Scenario | Consequence |
|---|---|---|---|
| False Negative | The query should follow the RAG path but was routed to the fallback path. | Queries that should be handled by RAG are misclassified as out-of-scope (Type B). | Leads to “I don’t know” responses or hallucinations caused by missing contextual grounding. |
| False Positive | The query should follow the fallback path but was routed to the RAG path. | Queries that should rely on the LLM’s general knowledge are sent to RAG. | Irrelevant context is injected into the prompt, producing confused or fabricated answers. |
| Question | Focus Area | Description |
|---|---|---|
| (RQ1) | Embedding model | What is the impact of domain-specific embedding models on RAG retrieval performance? |
| (RQ2) | Routing strategy | How effective is the reranker-based routing strategy in mitigating hallucinations? |
| (RQ3) | Generative LLM comparison | Within the optimized RAG pipeline, what differences in performance are observed among various open-source LLM architectures: Llama 3.1, Qwen2, Gemma-2, KoAlpaca, Mistral, and EXAONE? |
| Model Name | Organization/Developer | Parameters | Language/Tuning | Quantization |
|---|---|---|---|---|
| EXAONE-3.5-7.8B-Instruct | LG AI Research | 7.8B | Korean + English | 4-bit BnB |
| Gemma-2-9B-Instruct [21] | Google DeepMind | 9B | Multilingual | 4-bit BnB |
| KoAlpaca-Polyglot-12.8B | Eleuther | 12.8B | Korean | 4-bit BnB |
| Meta-Llama-3.1-8B-Instruct [22] | Meta Platforms | 8B | Multilingual | 4-bit BnB |
| Mistral-7B-Instruct-v0.3 | Mistral AI | 7B | Multilingual | 4-bit BnB |
| Qwen-2.5-7B-Instruct [23] | Alibaba Cloud | 7B | Multilingual | 4-bit BnB |
| Metric | Jhgan/Ko-Sroberta-Multitask | Sentence-Transformers/All-MiniLM-L6-v2 |
|---|---|---|
| MRR | 0.2675 | 0.0587 |
| Recall@1 | 0.2267 | 0.0500 |
| Recall@3 | 0.2867 | 0.0700 |
| Recall@5 | 0.3233 | 0.0700 |
| Recall@10 | 0.3700 | 0.0733 |
| Strategy | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| L2 Threshold | 0.8775 | 0.9197 | 0.9167 | 0.9182 |
| Reranker (proposed) | 0.7625 | 0.9556 | 0.7167 | 0.8190 |
| Strategy | Factual Accuracy | Hallucination Rate |
|---|---|---|
| Naive RAG | 0.4785 | 0.3399 |
| L2 Threshold | 0.5446 | 0.3135 |
| Reranker (proposed) | 0.7129 | 0.2442 |
| LLM Model | Quantization | Factual Accuracy | Hallucination Rate |
|---|---|---|---|
| Qwen2-7B-Instruct | 4-bit BnB | 0.6600 | 0.2500 |
| Gemma-2-9B-Instruct | 4-bit BnB | 0.6075 | 0.2250 |
| EXAONE-3.5-7.8B-Instruct | 4-bit BnB | 0.5525 | 0.2475 |
| Meta-Llama-3.1-8B-Instruct | 4-bit BnB | 0.5900 | 0.4175 |
| Mistral-7B-Instruct-v0.3 | 4-bit BnB | 0.5550 | 0.4475 |
| KoAlpaca-Polyglot-12.8B | 4-bit BnB | 0.3775 | 0.6275 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jang, D.-S.; Yi, J.-S.; Jeon, H.-B.; Hong, Y.-S. BioChat: A Domain-Specific Biodiversity Question-Answering System to Support Sustainable Conservation Decision-Making. Sustainability 2026, 18, 396. https://doi.org/10.3390/su18010396
Jang D-S, Yi J-S, Jeon H-B, Hong Y-S. BioChat: A Domain-Specific Biodiversity Question-Answering System to Support Sustainable Conservation Decision-Making. Sustainability. 2026; 18(1):396. https://doi.org/10.3390/su18010396
Chicago/Turabian StyleJang, Dong-Seok, Jae-Sik Yi, Hyung-Bae Jeon, and Youn-Sik Hong. 2026. "BioChat: A Domain-Specific Biodiversity Question-Answering System to Support Sustainable Conservation Decision-Making" Sustainability 18, no. 1: 396. https://doi.org/10.3390/su18010396
APA StyleJang, D.-S., Yi, J.-S., Jeon, H.-B., & Hong, Y.-S. (2026). BioChat: A Domain-Specific Biodiversity Question-Answering System to Support Sustainable Conservation Decision-Making. Sustainability, 18(1), 396. https://doi.org/10.3390/su18010396

