A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation

Chen, Zheng; Jiang, Zihao; Yang, Jianping

doi:10.3390/agriculture15222381

Open AccessArticle

A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation

by

Zheng Chen

,

Zihao Jiang

and

Jianping Yang

^*

College of Big Data, Yunnan Agricultural University, Kunming 650201, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(22), 2381; https://doi.org/10.3390/agriculture15222381

Submission received: 20 October 2025 / Revised: 7 November 2025 / Accepted: 12 November 2025 / Published: 18 November 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Foundation models for agriculture often suffer from fragmented and stale knowledge, making it difficult to deliver stable, traceable answers. We present an evidence-grounded retrieval-augmented generation (RAG) system for Yunnan Arabica coffee cultivation. First, we curate a lightweight knowledge base (approximately 250k Chinese characters) from cultivation textbooks, technical guidelines, and reports. Second, we adopt a retrieve–rerank–generate workflow: semantic-aware chunking with stable identifiers [docid#cid]; hybrid retrieval fused by reciprocal rank fusion (RRF); cross-encoder reranking on top; and final answer generation by DeepSeek v3.1 with mandatory inline evidence tags. In addition, we use GPT-5 Thinking to synthesize 346 gold QA items on the corpus with document-/chunk-level citations, and we evaluate with citation-level per-sample macro precision/recall/F1. On this gold set, our optimized system attains a citation-level per-sample macro F1 of 0.813 (81.3%), significantly outperforming a Simple RAG baseline that reads only a vector store (0.573; 57.3%). Error analysis shows that residual errors are dominated by fragment mismatch and missing evidence; latency analysis indicates that end-to-end delay is primarily driven by generation, whereas retrieval, fusion, and reranking incur sub-0.1 s overhead. The workflow preserves traceability and verifiability, supports hot updates via index rebuilding rather than model fine-tuning, and we release scripts for corpus construction, ablation, and citation-based evaluation to facilitate reproducibility.

Keywords:

retrieval-augmented generation; hybrid retrieval; RRF fusion; cross-encoder reranking; Chinese agronomy; coffee cultivation; Yunnan Arabica coffee; DeepSeek v3.1

1. Introduction

Yunnan Province lies in Southwest China, bordering Myanmar, Laos, and Vietnam. Positioned on a low-latitude plateau with marked vertical zonation, the region offers abundant sunshine and large diurnal temperature ranges, establishing a favorable ecological basis for tropical–subtropical crop cultivation [1]. Small-bean Arabica coffee is sensitive to temperature and precipitation; optimal growth generally requires a mean annual temperature of 15–24 °C and annual rainfall or equivalent irrigation of no less than 1200–1500 mm, and it is vulnerable to frost. Higher elevations help accumulate flavor compounds but demand attention to cold protection [2,3,4]. Owing to these climatic and topographic advantages, Yunnan has become China’s primary Arabica production area. Studies indicate that in 2020 the harvested area in Yunnan reached ∼100,000 ha, accounting for over 99% of the national planting area; multiple academic and industry sources suggest this share has long remained between 95% and 98% [5,6,7]. However, local production has traditionally relied on experiential practices and offline extension, with fragmented knowledge access and low standardization, which constrains the effective adoption of digital tools at the farm level. In recent years, retrieval-augmented generation (RAG) has offered a new pathway for agronomic knowledge services and decision support by coupling large language models with external knowledge as traceable, non-parametric memory [8].

For knowledge-intensive tasks, RAG follows a retrieve–rerank–conditional-generation pipeline, integrating traceable evidence into responses so that new knowledge can be absorbed rapidly without adjusting model parameters [8]. In practice for Chinese agricultural knowledge services, key bottlenecks include the reliance on word segmentation, as Chinese retrieval segmentation errors can degrade the relevance estimation of term-based sparse methods such as BM25 [9], and the scarcity of localized data tailored to specific regions and operational norms. Moreover, technical requirements evolve with policy changes, pest and disease dynamics, and shifts in varietal structure, making one-off model fine-tuning difficult to sustain [10,11]. We therefore adopt RAG rather than pure fine-tuning. By externalizing domain knowledge as non-parametric memory, RAG supports hot updates and provides traceable evidence for answers; prior studies also show that retrieval augmentation effectively reduces hallucinations and improves factual consistency [12]. To support the Yunnan coffee scenario, we curated and cleaned approximately 250k Chinese characters of agronomic materials covering key stages of ecological suitability, varieties and propagation, site establishment and pruning, fertilization and irrigation, shading and intercropping, pest/disease/disaster control, and harvesting and primary processing, together with topics such as high-altitude/cold-injury risk management and water–fertilizer scheduling for rainy seasons. Sources include academic papers, industry reports, and two textbooks/monographs related to Yunnan coffee cultivation management, providing the data and evidence basis for subsequent system design and experimental evaluation [9,10,11,12].

Beyond agriculture, production deployments typically choose among several LLM integration strategies: closed-book (prompt-only) LLMs, domain fine-tuned LLMs, long-context (context stuffing), classic retrieve–rerank–generate (RAG), tool/function-calling pipelines, knowledge-graph (KG) augmented LLMs, and freshness-aware online-search RAG. Closed-book models are simple to operate but offer weak traceability and slow knowledge refresh. Domain fine-tuning can internalize knowledge and stabilize task behavior, at the cost of high-quality data requirements and model/version governance. Long-context approaches inject source passages directly but face position sensitivity (“lost in the middle”) and rising computation [13]. Classic RAG externalizes knowledge for hot updates and traceable evidence, yet depends on chunking quality and retriever coverage, with end-to-end latency often dominated by generation [14,15]. On-demand/self-reflective RAG retrieves only when needed and critiques outputs, improving factuality and retrieval economy while increasing control complexity [16]. Tool/function-calling pipelines provide deterministic access to databases/APIs for structured tasks but require schema integration and guardrails. KG-augmented LLMs improve structural consistency and interpretability but incur curation cost and potential freshness gaps. Freshness-aware/online-search RAG improves recency while introducing source volatility, caching, and compliance considerations [17]. In this project, we adopt the classic retrieve–rerank–generate RAG because its strengths of hot updates, evidence traceability, and mature tooling align with our need for verifiable answers and rapid knowledge refresh in the Yunnan coffee domain. As a compact summary, Table 1 contrasts architectures by primary strengths and limitations.

Building on this foundation, our contributions are as follows. For the Yunnan coffee cultivation scenario, we construct a localized Chinese knowledge vector base (250k characters) by performing chapter- and topic-level segmentation and cleaning over papers, industry materials, and textbooks/monographs; we establish a stable identifier system [docid#cid] to form a retrievable, traceable, and updatable domain-specific knowledge base [5,6,7]. Using GPT-5 Thinking on the normalized corpus and semantic slices, we automatically construct and filter a small companion gold Q&A set, followed by human re-verification on a random sample of 100 items (28.9%) to remove obvious miscitation and noise. This gold set is used for system evaluation and ablation in our experiments and can serve as a reproducible verification test for Chinese agronomy, partially filling the gap of public fragment-level gold labels. We propose an evidence-first processing pipeline “semantic-aware chunking + sparse/dense hybrid retrieval + RRF fusion + cross-encoder reranking” and introduce History-Aware Query Rewriting (HAR) and In-Prompt History Injection (IHI) to align retrieval intent with generation context and improve multi-turn consistency [8]. From an engineering perspective, we adopt DeepSeek v3.1 as the generation engine, decouple and interface it with Chinese retrieval and reranking components, and design prompts plus domain-specific hyperparameters around local Yunnan terminology and common query forms, ultimately yielding an evidence-anchored, sustainably evolving expert Q&A workflow.

2. Materials and Methods

2.1. RAG Pipeline

Our system follows a two-stage framework of “offline knowledge-base construction + online retrieve–rerank–generate” (see Figure 1). In the offline stage, we normalize Chinese texts on Yunnan coffee cultivation, perform semantic-aware chunking, assign stable identifiers [docid#cid] to each slice, and write them into a vector store. In the online stage, we first apply History-Aware Query Rewriting (HAR): an enhanced query is formed from the most recent t dialog turns (default

t = 2

) and is used both for retrieval and prompt construction. We then conduct hybrid retrieval (BM25-based sparse and dense vector search), fuse results via RRF, rerank them with a cross-encoder, and feed the Top-K evidence to the generator. During generation, we use In-Prompt History Injection (IHI): after the system message, the most recent N turns are injected with fixed-length clipping (default

N = 8

; each truncated to

L = 800

characters), followed by the current task and evidence. Answers must include inline citations [docid#cid] after key factual statements; if no citations appear, the system appends “References […]” at the end to ensure traceability.

We adopt DeepSeek v3.1 as the base generator for three reasons. First, according to the DeepSeek-V3 technical report, models in this family achieve performance comparable to or better than leading open-source peers across multiple Chinese–English benchmarks (e.g., C-Eval, CMMLU, and CCPM; see Table 2 from [18]) and excel at high-difficulty tasks such as mathematics and coding, satisfying our need for “Chinese agronomy Q&A with robust reasoning” [18]. Second, LMSYS Chatbot Arena’s public preference evaluations show DeepSeek v3.1 reported competitiveness under hard-prompt settings, indicating strong competitiveness in open-ended QA and instruction following, which suits our role as an answer engine [19]. Third, official releases/model cards note that v3.1 introduces a “mixed thinking/non-thinking reasoning” mode, enhances tool-use/agent capabilities, and improves inference efficiency relative to V3, facilitating decoupled integration with our retrieval–reranking–evidence-citation pipeline and enabling observable alignment and controllable generation [20,21]. Overall, DeepSeek v3.1 offers advantageous Chinese capability, robust reasoning, and engineering usability well aligned with our study goals thus we select it as the base model for the coffee cultivation RAG expert.

2.2. Knowledge Base Construction and Stable Citation Identifiers

Given the scarcity of Chinese literature specifically targeting Arabica coffee cultivation in Yunnan, we curated approximately 250k Chinese characters of agronomic materials covering key stages of ecological suitability, varieties and propagation, site establishment and pruning, fertilization and irrigation, shading and intercropping, pest/disease/disaster control, and harvesting and primary processing.

Data Cleaning Rules. Before ingestion and indexing, we applied lightweight normalization: UTF-8 unification for all files; removal of non-content artifacts, including headers and footers, page numbers, watermarks, and obvious layout noise; merging of improper line breaks introduced by PDF-to-text conversion; whitespace normalization, collapsing repeated spaces, tabs, and newlines and trimming leading and trailing whitespace; and unit standardization, mapping common agronomic units to a canonical symbol set and preserving the original strings in metadata for traceability.

For semantic-aware splitting, we adopt a sentence-level strategy for long Chinese paragraphs. The texts are first segmented into sentences and clauses by Chinese punctuation and line breaks; cosine similarity between adjacent sentence embeddings is then computed using paraphrase-multilingual-MiniLM-L12-v2, and sentences are merged or cut by combining similarity with a preset length threshold. To visualize the segmentation results and the assignment of stable identifiers, we used this semantic segmentation model to produce several real slices from Chinese agronomy texts; see Table 3.

Hyperparameters for semantic-aware splitting.We use fixed thresholds in all experiments: minimum slice length

L_{min} = 200

characters; maximum slice length

L_{max} = 600

characters; a hard slice size

L_{hard} = 350

characters when a single sentence is excessively long; and a cosine-similarity cut-off

τ = 0.36

computed with paraphrase-multilingual-MiniLM-L12-v2 (384-d) embeddings. We do not apply explicit overlap by default (overlap

= 0

) to simplify evidence accounting and evaluation; the thresholds balance context budget against local semantic continuity (and account for long-context position-sensitivity risks; see [13]). For reproducibility, the retrieval-side defaults are RERANK_CANDIDATES

= 20

and FINAL_TOPK

= 5

, with the RRF smoothing constant RRF_K

= 60

.

For stable citation identifiers, the system uses doc_id as the document key; the identifier of each text slice is composed of position and chunk_id. Online services render citations in the human-readable form [docid#cid], where docid denotes doc_id and cid denotes position. When offline rebuilding changes identifiers, an id_map.json alignment is applied so that online references remain unaffected. This design prevents citation drift after incremental index rebuilding and supports reproducibility and auditability.

For the vector store and metadata, embeddings are written together with doc_id, position, chunk_id, source, and text. At render time, [docid#cid] is produced according to the aforementioned alias mapping, while chunk_id, a globally increasing index for runtime diagnostics, is excluded from the final evidence citations presented to end users.

2.3. Hybrid Retrieval and Fusion (RRF)

For a given query, we perform dual-channel retrieval in parallel. The first channel is BM25-based sparse retrieval with Chinese word segmentation, yielding a candidate set

C_{bm 25}

; the second encodes the query and document slices with a multi-sentence embedding model and performs similarity search to obtain

C_{dense}

. The results are merged by set union,

C = C_{bm 25} \cup C_{dense}

. To mitigate tokenization noise caused by domain terminology and unit expressions in Chinese, a domain lexicon (e.g., cultivar names, pests/diseases or nutrients, toponyms, and common units) is incorporated at the segmentation stage. We then fuse multi-channel candidates using reciprocal rank fusion (RRF), which has demonstrated strong cross-channel robustness and is widely adopted in both academic and production search systems [22,23,24,25,26].

To integrate heterogeneous channels in a stable manner, we aggregate ranks with RRF:

RRF (d) = \sum_{m \in M} \frac{1}{k + {rank}_{m} (d)}, k = 60 .

(1)

k = 60

, a community-standard constant shown to be near-optimal yet not critical across TREC-style metaranking, acts as a smoothing term that dampens outlier high ranks so no single channel dominates, and it is widely adopted in hybrid retrieval pipelines [26,27,28]. Here, M denotes the set of participating channels (BM25 and dense retrieval in our case), and

{rank}_{m} (d)

is the rank of candidate d in channel m. Candidates are globally sorted by the fused score in Equation (1); the top-K items are then passed to the cross-encoder reranking stage.

To prevent the same slice from being double-counted across channels, we deduplicate by the key

(s o u r c e, c h u n k_i d)

. During generation and evaluation, evidence is presented with the stable identifier [docid#cid], consistent with the mapping defined in Section 2.2.

2.4. Cross-Encoder Reranking

Sparse/dense retrieval provides broad coverage but limited granularity for relevance estimation, which makes it difficult to robustly capture domain-specific details. To improve the usability and faithfulness of candidate evidence, we rerank the fused candidate set with a cross-encoder aimied at fine-grained semantic alignment. We employ bge-reranker-base as the cross-encoder and compute a relevance score

s (q, d)

for each pair

(q, d)

; unlike bi-encoders, the cross-encoder jointly encodes the query and the text slice and is more sensitive to details such as numeric thresholds, unit conversions, conditional constraints, and entity references [29,30,31,32].

Let the fused candidate set be

C = {d_{i}}_{i = 1}^{N}

(default

N = 20

, RERANK_CANDIDATES). The reranking objective follows the original formulation:

TopK (C) = {argmax}_{d \in C}^{K} s (q, d) .

(2)

Here,

{argmax}^{K}

returns the set of the top-K items with the highest score

s (q, \cdot)

; K denotes the number of evidence slices retained after reranking (default

K = 5

), and N is the size of the fused candidate set (default

N = 20

). We pass

TopK (C)

as the evidence to the generation stage. In implementation, we use batched inference and GPU acceleration to control the forward cost; reranking only changes the order of candidates and does not modify the text or metadata, ensuring the traceability of inline citations [docid#cid] (cf. Section 2.2).

Let N denote the number of fused candidates; the forward complexity of the cross-encoder is

O (N)

. We cap

N \leq 20

to balance quality and efficiency and rely on batching plus GPU acceleration to keep end-to-end latency acceptable. During reranking, only the candidate order changed content and metadata remain intact so the evidence citations [docid#cid] remain stable (see Section 2.2). In our agronomic scenario, reranking markedly reduces errors such as unit/threshold/entity mismatches; prior BERT-based reranking studies consistently improve nDCG/MRR on MS MARCO/BEIR with the primary cost being inference latency [29,30,31,32]. Under our setting (

N \leq 20

,

K = 5

), the additional latency is acceptable.

2.5. History-Aware Query Rewriting and In-Prompt History Injection (HAR and IHI)

History-Aware Query Rewriting (HAR). To mitigate under-retrieval and semantic drift caused by short queries or omitted context, HAR constructs an augmented query from the most recent t turns of the session:

q^{'} = f (q, H_{t}), default t = 2,

(3)

where

H_{t}

denotes the last-t conversational turns. In practice, we read the recent messages from the session store to generate

q^{'}

, which is used in two places: once for retrieval (executing recall with

q^{'}

) and once for generation (assembling prompts with

q^{'}

and the Top-K evidence). This aligns the retrieval intent with the generation context and reduces mismatches between cited evidence and surface wording.

In-Prompt History Injection (IHI). To maintain multi-turn consistency, IHI injects the most recent N turns as a message sequence into the generation prompt (default

N = 8

). To control the context length, we apply fixed-length clipping: retain the first L characters of the assistant replies and the last L characters of the user inputs (default

L = 800

). This preserves key conclusions and the latest constraints while limiting latency and off-topic risks.

HAR affects both the retrieval and generation prompts, whereas IHI only affects the construction of the generation prompt. Neither method rewrites evidence text or metadata; hence, the stability of the inline citation [docid#cid] and the evaluation protocol remain unaffected. By default, we use

t = 2

,

N = 8

, and

L = 800

; these settings are empirical engineering choices (combining common practice with small preliminary sensitivity checks), motivated by prior evidence on long-context position sensitivity [13]. The exact values are not specified by [13] and are selected to balance effectiveness and latency under our task scale.

2.6. Evaluation Data and Gold Construction

Given the absence of public Chinese agronomy QA benchmarks and no existing benchmark specifically targeting Yunnan Arabica coffee cultivation, we automatically constructed a companion gold set on top of our in-house corpus (see Section 2.2). The set contains 346 question–answer–citation triplets covering ecological suitability, cultivation management, and pest/disease topics. To rapidly establish a reproducible evaluation baseline in this niche domain where fragment-level gold labels are scarce, we adopt an LLM-based synthesis approach and draw on prior work that uses LLMs as annotators or synthetic data sources [33,34].

The gold samples are generated by GPT-5 Thinking over normalized corpus slices. The model is instructed to produce practice-oriented questions and reference answers for agronomists and to include mandatory inline evidence tags in the format [docid#cid]. We retain only those cases in which all inline citations are parsable and can be resolved to specific chunks; citations are normalized to the paper-wide identifier space (docid ≡doc_id, cid ≡ position; cf. Section 2.2), and historical IDs are aligned offline via id_map.json to ensure online consistency. We employ the model in a reasoning mode to improve inference quality, but forbid any chain-of-thought text in outputs; the pipeline preserves only structured fields (query, short answer, [docid#cid] citations, metadata), and we do not collect, store, or release any CoT content (see Gold-QA synthesis prompts in Appendix A—Implementation Details for Reproducibility).

To reduce noise and bias in synthesized gold, we enforce automatic checks: citation parse rate is a hard admission criterion; each question is restricted to 1–3 citations; duplicate items and overly short or overly generic questions are removed; topic coverage across the seven cultivation stages is balanced. We conduct a full manual audit of all 346 items using simple, reproducible criteria, The inline citation must resolve to the cited chunk and substantively support the answer; answers with factual inconsistencies or unsupported claims are revised or discarded; ambiguous prompts and near-duplicates are removed; the 1–3 citation constraint is enforced; minor wording is harmonized without altering the evidence. This workflow follows emerging practice of using LLMs as annotators with human verification and quality control [33,35].

For model comparison, we adopt a citation-set consistency metric: for each question, we compare the predicted citation set with the gold citation set and compute precision, recall, and F1, followed by macro-averaging across samples. This metric depends only on the evidence sets rather than natural-language wording, thereby reducing the impact of stylistic differences. Considering that the gold is LLM-generated and may carry stylistic or preference biases, we mitigate risks via strict parsing and manual checks, and by using citation-set–based objective scoring [33,34,36].

3. Results

3.1. Overall Performance

We use citation-based F1 as the primary metric, defined as per-sample macro F1 over inline evidence tags [docid#cid]: for each question, we compute precision/recall/F1 by matching the predicted citation set to the gold citation set and then macro-average across samples (default at the chunk level). The gold set is automatically constructed by GPT-5 Thinking on the same corpus with automatic checks and a full manual audit of all 346 items (Section 2.6); hence, the reported F1 reflects consistency with the evidence-bearing corpus and absolute values should be further verified on independent human-authored sets.

Our Model (Full RAG) combines hybrid retrieval (BM25 + dense with RRF), cross-encoder reranking (bge-reranker-base), and generation with inline evidence tagging by DeepSeek v3.1 (with a minimal backfill mechanism when needed). The Simple RAG baseline performs dense-only retrieval from Chroma and generates directly (without BM25, RRF, reranking, or HAR/IHI); prompts and generation settings are otherwise aligned.

Over three independent runs, Our Model achieves 0.813 ± 0.001 citation-F1, significantly higher than the Simple RAG baseline at 0.573 ± 0.003. Precision and recall improve in tandem, and Our Model shows the smallest cross-run variation (SD of F1

= 0.001

). See Figure 2 for the bar chart comparison and Section 3.3 for ablation studies.

We further quantify uncertainty and statistical significance. We estimate 95% confidence intervals with item-level non-parametric bootstrap using 5000 replicates. Each replicate resamples the 346 questions with replacement; for each question we first average F1 across the three runs, then average across questions; the interval is given by the 2.5th and 97.5th percentiles of the bootstrap means. For significance, we apply an item-level paired permutation test by randomly flipping the signs of the per-item F1 differences between Our Model and the Simple RAG baseline over 5000 permutations and computing a two-sided p value. At the document level, Our Model attains a macro-F1 of 0.813 with 95% CI [0.781, 0.843], while the Simple RAG baseline attains 0.573 with 95% CI [0.529, 0.617]. The paired mean difference is +0.240 with 95% CI [0.196, 0.284], and the permutation test yields

p = 0.0002

, indicating a statistically significant improvement. The bootstrap histograms are shown in Figure 3 and Figure 4.

3.2. Human Evaluation

We establish a human evaluation protocol for agronomic QA with four labels: (a) conforms to agronomic consensus (correct and sufficiently supported); (b) contains incorrect content; (c) content omission (missing key facts/conditions); (d) potential bias in wording or framing. This setup follows recent human-evaluation practices that assess factuality, completeness, and readability beyond citation matching, providing a direct view of end-answer quality [37].

An agronomy professional evaluates outputs from two systems: the Simple RAG baseline and Our Model. Each system is assessed on 1038 records (346 questions × 3 runs), totaling 2076 judgments. A single record may receive multiple labels (e.g., correct yet slightly incomplete), so proportions per label can exceed 100%. Unified guidelines and examples are provided prior to labeling to reduce subjective variance and improve reproducibility.

Figure 5 visualizes the results. Relative to the Simple RAG baseline, Our Model shows a higher share of label (a) “conforms to consensus” (42.9% → 82.9%), lower shares of (b) “incorrect content” (4.6% → 2.3%) and (c) “content omission” (56.9% → 21.2%), and a small change in (d) “potential bias” (6.5% → 5.6%). These outcomes are consistent with our citation-level gains: hybrid retrieval plus cross-encoder reranking improve fragment-level alignment and evidence usability, which appears in human judgments as higher correctness and fewer omissions.

3.3. Ablation Studies

We ablate Our Model by disabling or replacing exactly one component at a time while keeping the rest identical: B1 removes the cross-encoder reranker; B2a uses dense-only retrieval;B2b uses BM25 (sparse-only); B3 downgrades the embedding for semantic chunking/retrieval from paraphrase-multilingual-MiniLM-L12-v2 to all-MiniLM-L6-v2; B4 replaces the cross-encoder bge-reranker-base with the lighter ms-marco-MiniLM-L-6-v2; B5 disables HAR & IHI. All results are reported as citation-based F1 (per-sample macro). Numerical results are summarized in Table 4.

Relative to Our Model, disabling reranking (B1) reduces F1, and replacing it with a lighter reranker (B4) yields an even larger drop, indicating that fine-grained matching directly affects evidence usability and the accuracy of fragment localization; thus, reranking is key. BM25-only (B2b) comes close to Our Model, whereas dense-only (B2a) drops substantially; in hybrid retrieval, sparse outperforms dense-only and BM25 plays the primary role. Chinese domain terms (cultivars, pest/disease names, units/thresholds) contribute more to recall in our task than semantic similarity alone. Downgrading the embedding (B3) causes a moderate decline, suggesting that the consistency and quality of chunking/retrieval embeddings matter, but their marginal effect is smaller than that of reranking and BM25. Query rewriting and in-prompt history injection provide limited gains in single-turn settings: on our data (mostly single-turn, well-posed queries), disabling them (B5) changes results only slightly, implying that their main value lies in multi-turn clarification scenarios.

To validate the effectiveness of our semantic chunking, we compare a fixed-length sliding-window chunking scheme against Our Model (Full RAG), which performs sentence merging/splitting via MiniLM-based cosine similarity. As shown in Table 5, on the same evaluation set, the MiniLM-based semantic merge/split achieves a citation-based F1 (per-sample macro) improvement of +0.037 over the sliding-window scheme, confirming that semantic merge/split is more effective than a naive fixed window.

3.4. Error Analysis

The error distribution is dominated by fragment-level mismatch and missing evidence: WrongChunk denotes retrieving the correct document but selecting the wrong chunk or drifting at boundaries (

46.1 %

); MissingEvidence denotes that the gold fragment is not covered or is not converted into an inline citation (e.g., insufficient Top-K coverage or failure to cite during generation;

34.4 %

); WrongDoc denotes that the retrieved or top-ranked document is incorrect (

19.5 %

); and NoCitation denotes that no inline citation is produced (

0.04 %

, almost negligible). These categories correspond, respectively, to chunk granularity and boundary/overlap settings, candidate-pool coverage and citation conversion at generation time, retrieval-channel bias and tokenization noise, and prompt/threshold choices that suppress citation emission.

These results indicate that, in most cases, the system does retrieve the correct document(s) but either selects the wrong fragment at the chunk level or fails to cover the gold fragment. A secondary pattern is that the candidate pool contains the correct information but the generator does not convert it into inline citations (or the pool still lacks a critical piece of evidence). Given our pipeline, these errors are closely tied to (i) chunking granularity (length and semantic boundaries), (ii) candidate-pool coverage (Top-K after RRF fusion), and (iii) the cross-encoder’s ability to capture fine-grained matches involving conditions, thresholds, units, and entity references.The detailed distribution of these error types is presented in Figure 6.

3.5. Latency and Cost Breakdown

The end-to-end latency distribution exhibits a moderate median and a generation-dominated long tail.The percentile-level latency distribution is visualized in Figure 7. In the component-wise average cost, the LLM accounts for a very large share, whereas the cross-encoder reranker, BM25, and dense retrieval contribute only small shares, and RRF fusion is negligible; the remaining overhead comes from prompt construction, I/O, and framework-level operations. Overall, the bottleneck lies almost entirely on the generation side, while retrieval and reranking add only marginal engineering overhead, yielding a favorable quality–latency trade-off (see Section 3.3). The cost distribution across all components is presented in Figure 8. For a detailed view of the smaller-cost components that are difficult to discern in the main figure, Figure 9 provides a zoomed-in display of the BM25, dense retrieval, RRF, and the reranker.

4. Discussion

4.1. Restatement of Main Findings and Contributions

In the knowledge-intensive and terminology-heavy context of Chinese agronomy for Yunnan Arabica coffee, we systematically validate an evidence-grounded RAG pipeline. Compared with the Simple RAG baseline, our model achieves a significant improvement in citation-based F1 with very small cross-run variance, indicating stable performance. Ablations show that fine-grained reranking (cross-encoder) and Chinese sparse retrieval (BM25 plus a domain lexicon) are decisive components; dense retrieval complements coverage and semantic recall but cannot, on its own, guarantee chunk-level hits. From an engineering standpoint, we implement semantic-aware chunking, stable inline citations [docid#cid], and a lightweight backfill mechanism, yielding traceable and auditable answers while naturally supporting hot updates to the knowledge base. This maintains faithfulness and maintainability without fine-tuning the base model (see Section 2.3, Section 2.4, Section 2.5, Section 2.6, Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5).

4.2. Insights from the Error Structure

Error analysis shows that WrongChunk (correct document but wrong chunk) and MissingEvidence (omitted evidence) dominate at approximately 46.1% and 34.4%, respectively, whereas WrongDoc and NoCitation are much smaller (see Section 3.4). This distribution suggests two principal bottlenecks: (i) insufficient fragment-level recall due to chunking and coverage, and (ii) failure in the reranking/generation stages to reliably convert retrieved evidence into inline citations. Targeted improvements include relaxing or optimizing chunking strategies (e.g., sliding windows/overlap), enlarging the candidate pool and strengthening the cross-encoder’s fine-grained discrimination over conditions, thresholds, units, and entities, and introducing sentence-level tagging constraints and consistency checks at the prompt level (see Section 2.2 and Section 2.4).

This error pattern is consistent with conclusions in the retrieval/RAG literature: In evidence-grounded QA, the real bottleneck is not merely recalling the right document but aligning to the correct fragment within it. Large heterogeneous IR evaluations (e.g., BEIR) indicate that BM25 is a robust lower bound, whereas reranking or late-interaction models deliver the largest gains at passage/chunk granularity albeit at higher computational cost. This matches our ablations: Once the cross-encoder reranker is removed or weakened, citation-F1 drops markedly, implying that for fine-grained factors such as terminology, thresholds, and units, landing on the correct fragment is the deciding factor [29,38].

Long-context position sensitivity further aggravates WrongChunk: Prior work shows that models tend to under-utilize evidence placed in the middle of long contexts, which makes chunk boundaries/overlap and the ordering of evidence non-trivial design choices for RAG. In practice, modest overlap or sliding-window alternatives can help recover fragment-level recall, but they still need to be paired with reranking to avoid injecting loosely related fragments [13].

Taken together, the 46.1% share of WrongChunk is not incidental but a commonly reported RAG failure mode: documents are retrieved, yet fragment-level alignment remains limiting. In follow-up experiments, we will explore dynamic chunking to mitigate WrongChunk, adapting window size and overlap to punctuation/entity density and the presence of numbers/units, with a sliding-window fallback when multiple thresholds/units co-occur to reduce boundary-crossing errors [13,29,38].

4.3. Implications for Latency and Cost

The end-to-end median latency is on the order of 6–7 s, with the long tail mainly driven by generation. The LLM accounts for the overwhelming share of runtime, whereas BM25, dense retrieval, and reranking are at the millisecond level, and RRF is negligible. Consequently, end-to-end latency is almost entirely dominated by generation, while retrieval, fusion, and reranking contribute only a minor share; under our current configuration, modestly strengthening retrieval/reranking can yield substantial fragment-level alignment gains at very small additional latency (see Section 3.3). As a practical caveat, increasing Top-K evidence enlarges the prompt and can raise generation time; we therefore keep

K = 3

–5 in practice to balance quality and latency. Looking ahead, latency optimization should primarily target the LLM side. Practical options include capping the maximum generation length, enabling streaming output and KV-cache reuse, and introducing lightweight model routing so that simpler queries are served by a smaller model, thereby reducing tail latency without sacrificing traceability.

4.4. Limitations of Corpus Scale and Gold Labels

Our knowledge base contains roughly 250k Chinese characters (see Section 2.2). While it covers core topics such as ecological suitability, cultivation management, and pest/disease control, it remains limited with respect to regional variation, operational thresholds, cultivar differences, and micro-scenarios (e.g., management under different elevations/microclimates). Scaling the corpus is therefore a priority for both engineering and methodology: to improve coverage and external validity, we plan to expand the knowledge base by at least one order of magnitude (from ∼0.25 M toward ≥1 M characters) while emphasizing diversity and temporal coverage.

The 346-item gold set was initially synthesized by a large language model and evaluated with a citation-set metric. We acknowledge a circularity risk: testing an LLM-based system against LLM-synthesized targets may partly measure alignment with another model’s retrieval and phrasing patterns rather than real-world usefulness [39]. To mitigate this, we have conducted a full manual audit of all 346 items by agricultural professionals under consistent acceptance criteria. Nevertheless, further mitigation is required in subsequent experiments: we will retain the cross-model evaluation setup while adding an independent, human-authored test set with dual annotation and adjudication, and we will include adversarial and long-context stress tests to better probe fragment-level alignment and practical utility [40,41].

4.5. On HAR/IHI Effectiveness and Improvements

HAR and IHI are intended for multi-turn consistency, addressing coreference, ellipsis, and semantic drift; therefore, under a largely single-turn, well-posed evaluation, their marginal gains are expected to be limited. As shown in Table 4, disabling HAR/IHI yields only a small change (F1:

0.813 \to 0.808

). This indicates that under the present task configuration HAR/IHI are not the primary source of gains; it does not, however, preclude their utility in multi-turn settings with coreference, ellipsis, or semantic drift. Prior conversational IR work consistently reports that rewriting conversational history into a standalone, retrievable query improves effectiveness, with a stable gap between human and automatic rewrites (e.g., reported improvements in TREC CAsT 2019) [42]; public CQR resources and methods (e.g., CANARD, explicit query rewriting) offer transferable techniques [43,44]. Accordingly, our future work is twofold: replacing heuristic HAR with supervised CQR and reporting rewrite quality and retrieval gains on CAsT-style multi-turn sessions; and upgrading IHI from a fixed-window injection to budgeted, salience-driven summaries/memory, coupled with generation-time self-critique (e.g., on-demand retrieval and reflective critique in Self-RAG) to mitigate history noise and off-topic injection [16]. Given long-context position sensitivity, HAR/IHI history lengths and injection policies should remain bounded and be co-tuned with reranking [13].

5. Conclusions

We developed and evaluated an evidence-grounded retrieval-augmented generation (RAG) system for Chinese agronomy in the Yunnan Arabica coffee setting. Leveraging hybrid retrieval (BM25 + dense with RRF fusion), a cross-encoder reranker, and inline evidence tags

[docid # cid]

, the system improves citation-based F1 (per-sample macro) over a Simple RAG baseline without fine-tuning the base model. The outputs are traceable and auditable, and they immediately reflect new evidence as the knowledge base is updated. Empirically, Chinese sparse retrieval (with a domain lexicon) and fine-grained reranking are decisive for chunk-level grounding, whereas the generation stage dominates end-to-end latency and the time overheads of retrieval and reranking are comparatively negligible.

Built on a normalized corpus of roughly 250k Chinese characters specific to Yunnan coffee, we constructed a localized knowledge vector base. We further used GPT-5 Thinking to automatically create and filter a small companion gold set on top of semantic slices; this set supports both system evaluation and ablations and can serve as a reproducible verification probe for Chinese agronomy. By design, the gold items enforce parsable citations, cap the number of inline references (1–3), and maintain topical coverage. Our evaluation adopts citation-set precision/recall/F1 to mitigate the influence of stylistic variation in natural-language answers.

On the engineering side, we implemented lightweight normalization adapted to Chinese technical text, semantic-aware chunking, cross-encoder reranking, and stable citation identifiers, together with a fallback tagging mechanism to ensure minimal evidence coverage in answers. On the research side, systematic ablations and latency breakdowns clarify the gains and costs of each component, providing an operational baseline for larger-scale deployments and reproducible studies in agricultural knowledge services.

Author Contributions

Conceptualization, Z.C. and J.Y.; software, Z.C.; investigation, Z.C.; validation, Z.C. and Z.J.; formal analysis, Z.J.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Yunnan Provincial Science and Technology Talent and Platform Program (Academician Expert Workstation) under Grant No. 202405AF140013 and the Yunnan Provincial Major Science and Technology Special Project under Grant No. 202502AE090019.

Data Availability Statement

The RAG implementation code, evaluation scripts, and configuration files supporting this study are openly available on GitHub and archived under the “RAG” release tag: https://github.com/Luc-GPU/coffee-rag-yunnan-arabica/releases/tag/RAG (accessed on 31 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Implementation Details for Reproducibility

Appendix A.1. Note on Prompt Templates

All production prompt templates are authored in Chinese. For reproducibility, we provide faithful English translations below. Semantics, constraints, and formatting instructions are unchanged except for language.

Appendix A.2. LLM Decoding Hyperparameters

Unless otherwise noted, we use the following: max_tokens = 512, temperature = 0.3, top_p = 0.95, frequency_penalty = 0, presence_penalty = 0. These settings match the configuration used for the reported results.

Appendix A.3. Chunking Thresholds (Semantic-Aware Splitting)

We use fixed thresholds across all experiments: minimum slice length

L_{min} = 200

characters; maximum slice length

L_{max} = 600

characters; hard slice size

L_{hard} = 350

characters for overly long sentences; cosine-similarity cut-off

τ = 0.36

with paraphrase-multilingual-MiniLM-L12-v2 (384-d) embeddings; default overlap

= 0

. See Section 2.2 for details.

Appendix A.4. Retrieval and Reranking Defaults

RERANK_CANDIDATES

= 20

, FINAL_TOPK

= 5

, RRF smoothing constant RRF_K

= 60

. The cross-encoder reranker is bge-reranker-base unless otherwise stated.

Appendix A.5. Hardware and Throughput Settings

All latency results were measured on a local workstation (Windows + Anaconda, Python 3.10.x). Reranking is batched and GPU-accelerated when available; BM25 and dense retrieval are CPU-bound. End-to-end latency is generation-dominated (see Section 3.4). Device details used for reporting: GPU: NVIDIA GeForce RTX 2070; GPU driver: GeForce Game Ready 577.00; CUDA: 12.1 (PyTorch build +cu121); CPU: Intel Core i7-9750H @ 2.60 GHz; RAM: 16 GB. Key libraries (project requirements): torch 2.5.1 + cu121; sentence-transformers 2.7.0; transformers 4.41.2; langchain 0.1.14; chromadb 0.4.24; huggingface-hub < 0.26.0; PyMuPDF 1.26.3; python-docx 1.1.0; rank-bm25 0.2.2; numpy 1.26.4; requests 2.31.0.

Appendix A.6. Gold-QA Synthesis Prompts (English Translations)

Note. These templates are used only for synthesizing the gold Q&A items from corpus slices. The model runs in a reasoning mode to improve inference quality, but chain-of-thought text is forbidden in outputs; only structured fields are preserved.

(G1) Gold-QA Generator (single-span, atomic QA)

You are a domain editor creating evidence-grounded gold QA in Chinese for Yunnan Arabica~coffee.

INSTRUCTIONS:

(1) Read the provided SOURCE SPAN (exactly one chunk). Compose ONE atomic

question that can be answered solely from this span.

(2) Produce a SHORT answer (<= 40 Chinese characters when possible),

strictly supported by the span.

(3) Add 1-3 inline citations in the form [<doc_id>#<cid>] that point to

THIS span only.

(4) Avoid background, hedging, policy advice, or~multi-source synthesis.

No chain-of-thought in output.

(5) Enforce the JSON schema exactly and return ONE JSON object~only.

INPUT:

- doc_id: {{DOC_ID}}

- cid: {{CID}}

- source_excerpt: {{TEXT}}

- taxonomy_hint: {{OPTIONAL_TOPIC_HINT}}

OUTPUT JSON fields:
id, query, answer, citations, refs, topic, difficulty, type, gold_citations, gold_answer

(G2) Gold-QA Validator (deterministic fact check)

You are a fact validator. Given a QA JSON and the cited SOURCE

SPAN, decide:

- SUPPORT: Does the span explicitly entail the answer?

- NUMERIC: Are all numbers/units/ranges present in the span and

copied faithfully?

- SCOPE: Is the answer limited to span facts (no external claims)?

Return:
{ "pass": true|false, "reasons": "…" }

Rules:
- temperature=0.0; top_p=1.0; do not rewrite the QA.
- Fail if any criterion is not met.

(G3) Gold-QA Standardizer & Deduper

Normalize the QA JSON:
- Canonicalize units (Celsius, mm) and remove redundant
punctuation/markdown.
- Ensure the "answer" stays concise; drop extra wording.
- Compare against a provided list of existing queries; if
semantic similarity
>= 0.9, mark as~duplicate.

Return the normalized JSON and a flag:
{ "duplicate": true|false }

Appendix A.7. Gold-QA Decoding Hyperparameters (Overrides)

Unless otherwise noted: temperature = 0.2 (G1), temperature = 0.0 (G2/G3), top_p = 0.95, max_tokens = 256. These override the general decoding defaults for the gold-QA pipeline only.

Appendix A.8. Prompt Templates (English Translations)

(A) System prompt

You are an agronomy QA assistant for the "Yunnan Arabica coffee" scenario.

Answers must be grounded in the provided evidence, and~you must add inline

evidence tags [docid#cid] immediately after key factual statements. Do not

fabricate facts or cite unseen sources. If~no suitable evidence is

available, explain the gap and append "References: […]" at the end. Keep

answers concise, professional, and~traceable. When thresholds/units/

conditions are involved, place [docid#cid] after the corresponding sentence.

(B) HAR (history-aware query rewriting) prompt

Task: Using the most recent t=2 turns, rewrite the user need into a

standalone, retrievable query. Preserve entities, thresholds, units,

locations, and~time constraints; remove chit-chat and irrelevant content.

Output: Only the rewritten query, with~no explanations.

(C) Generation prompt skeleton

Task: Answer the user’s question and include inline evidence tags

[docid#cid] after key~facts.

Evidence (Top-K corpus slices, each with [docid#cid]):

<evidence_1_text> [docid#cid]

<evidence_2_text> [docid#cid]

…

Output constraints:
(1) Use only the provided evidence; if insufficient, explain what is missing
and append "References: […]" at the end.
(2) When numbers/thresholds/units/conditions are cited, tag the sentence
with [docid#cid].
(3) Language: Chinese. Style: practitioner-oriented, concise,
accurate, actionable.

Gold-QA decoding hyperparameters (overrides).

Unless otherwise noted: temperature=0.2 (G1), temperature=0.0 (G2/G3), top_p=0.95, max_tokens=256. These override the general LLM decoding defaults for the gold-QA pipeline only.

Prompt templates (English translations of the Chinese originals).

(A) System prompt

You are an agronomy QA assistant for the "Yunnan Arabica coffee" scenario.
Answers must be grounded in the provided evidence, and~you must add inline
evidence tags [docid#cid] immediately after key factual statements. Do not
fabricate facts or cite unseen sources. If~no suitable evidence is
available, explain the gap and append "References: […]" at the end.
Keep answers concise, professional, and~traceable. When thresholds/units/
conditions are involved, place [docid#cid] after the corresponding sentence.

(B) HAR (history-aware query rewriting) prompt

Task: Using the most recent t=2 turns, rewrite the user need into a
standalone, retrievable query. Preserve entities, thresholds, units,
locations, and~time constraints; remove chit-chat and irrelevant content.
Output: Only the rewritten query, with~no explanations.

(C) Generation prompt skeleton

Task: Answer the user’s question and include inline evidence tags
[docid#cid] after key~facts.

Evidence (Top-K corpus slices, each with [docid#cid]):
<evidence_1_text> [docid#cid]
<evidence_2_text> [docid#cid]
…

Output constraints:
(1) Use only the provided evidence; if insufficient, explain what is missing
and append "References: […]" at the end.
(2) When numbers/thresholds/units/conditions are cited, tag the sentence
with [docid#cid].
(3) Language: Chinese. Style: practitioner-oriented, concise,
accurate, actionable.

References

Wikipedia Contributors. Yunnan. Wikipedia, the Free Encyclopedia. Available online: https://en.wikipedia.org/wiki/Yunnan (accessed on 16 October 2025).
Food and Agriculture Organization of the United Nations (FAO). Coffee: Introduction. FAO Corporate Document Repository (X6939e01). Available online: https://www.fao.org/3/x6939e/x6939e01.htm (accessed on 16 October 2025).
FAO. Arabica Coffee Manual for Lao PDR; FAO: Bangkok, Thailand, 2005; Available online: https://www.fao.org/3/ah833e/ah833e.pdf (accessed on 16 October 2025).
FAO. Arabica Coffee Manual for Myanmar; FAO: Rome, Italy, 2015; Available online: https://openknowledge.fao.org/items/ba78b670-0947-4e73-ad10-091009c0dfc3 (accessed on 16 October 2025).
Wang, X.; Ye, T.; Fan, L.; Liu, X.; Zhang, M.; Zhu, Y.; Gole, T.W. Extreme Cold Events Threaten Arabica Coffee in Yunnan, China. npj Nat. Hazards 2025, 2, 32. Available online: https://www.nature.com/articles/s44304-025-00092-5 (accessed on 16 October 2025). [CrossRef]
Liu, X.; Tan, Y.; Dong, J.; Wu, J.; Wang, X.; Sun, Z. Assessing Habitat Selection Parameters of Coffea arabica Using BWM and BCM Methods Based on GIS. Sci. Rep. 2025, 15, 8. Available online: https://www.nature.com/articles/s41598-024-84073-0 (accessed on 16 October 2025).
International Coffee Organization (ICO). Coffee in China; International Coffee Council, 115th Session, ICC-115-7; ICO: London, UK, 2015; Available online: https://www.ico.org/documents/cy2014-15/icc-115-7e-study-china.pdf (accessed on 16 October 2025).
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP. Advances in Neural Information Processing Systems 33 (NeurIPS 2020). 2020. Available online: https://arxiv.org/abs/2005.11401 (accessed on 16 October 2025).
Wintgens, J.N. (Ed.) Coffee: Growing, Processing, Sustainable Production, 2nd ed.; Wiley-VCH: Weinheim, Germany, 2009. [Google Scholar]
Shuster, K.; Xu, J.; Komeili, M.; Smith, E.M.; Roller, S.; Boureau, Y.-L.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. Findings of the Association for Computational Linguistics: EMNLP 2021. pp. 3784–3803. 2021. Available online: https://aclanthology.org/2021.findings-emnlp.322 (accessed on 16 October 2025).
FAO. Digital Technologies in Agriculture and Rural Areas: Status Report; FAO: Rome, Italy, 2019; Available online: https://www.fao.org/3/ca4985en/ca4985en.pdf (accessed on 16 October 2025).
Clifford, M.N.; Willson, K.C. (Eds.) Coffee: Botany, Biochemistry and Production of Beans and Beverage; Croom Helm: London, UK, 1985. [Google Scholar]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Fan, W.; Ding, Y.; Ning, L.; Wang, S.; Li, H.; Yin, D.; Chua, T.S.; Li, Q. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. arXiv 2024, arXiv:2405.06211. Available online: https://arxiv.org/abs/2405.06211 (accessed on 3 November 2025). [CrossRef]
Yu, H.; Gan, A.; Zhang, K.; Tong, S.; Liu, Q.; Liu, Z. Evaluation of Retrieval-Augmented Generation: A Survey. arXiv 2024, arXiv:2405.07437. Available online: https://arxiv.org/abs/2405.07437 (accessed on 3 November 2025).
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. Available online: https://arxiv.org/abs/2310.11511 (accessed on 1 November 2025). [CrossRef]
Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. Findings of the Association for Computational Linguistics: ACL 2024. pp. 13793–13817. 2024. Available online: https://aclanthology.org/2024.findings-acl.813.pdf (accessed on 3 November 2025).
DeepSeek-AI. DeepSeek-V3 Technical Report. 2024. Available online: https://arxiv.org/abs/2412.19437 (accessed on 16 October 2025).
LMSYS. Overview—Chatbot Arena Leaderboard. 2025. Available online: https://lmsys.org/blog/2023-05-03-arena/ (accessed on 16 October 2025).
DeepSeek-AI. DeepSeek-V3.1 (Model Card). 2025. Available online: https://huggingface.co/deepseek-ai/DeepSeek-V3.1 (accessed on 16 October 2025).
LMSYS. Introducing the Hard Prompts Category in Chatbot Arena. 2024. Available online: https://lmsys.org/blog/2024-05-17-category-hard/ (accessed on 16 October 2025).
Microsoft. Relevance Scoring in Hybrid Search Using Reciprocal Rank Fusion (RRF). Microsoft Learn Documentation, Updated 28 September 2025. Available online: https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking (accessed on 16 October 2025).
Elastic. Reciprocal Rank Fusion—Elasticsearch Reference. Available online: https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion (accessed on 16 October 2025).
Microsoft. Hybrid Search (BM25+Vector) Overview. Microsoft Learn Documentation. Available online: https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview (accessed on 16 October 2025).
Sarthi, P.; Abdullah, S.; Tuli, A.; Khanna, S.; Goldie, A.; Manning, C.D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. arXiv 2024, arXiv:2401.18059. Available online: https://arxiv.org/abs/2401.18059 (accessed on 3 November 2025). [CrossRef]
Cormack, G.V.; Clarke, C.L.A.; Büttcher, S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09), Boston, MA, USA, 19–23 July 2009; ACM: New York, NY, USA, 2009; pp. 758–759. [Google Scholar] [CrossRef]
Bruch, S.; Gai, S.; Ingber, A. An Analysis of Fusion Functions for Hybrid Retrieval. ACM Trans. Inf. Syst. 2023, 42, 20. [Google Scholar] [CrossRef]
Bendersky, M.; Zhuang, H.; Ma, J.; Han, S.; Hall, K.; McDonald, R. Meeting the TREC-COVID Challenge with a 100+ Runs: Precision Medicine, Search, and Beyond. arXiv 2020, arXiv:2010.00200. Available online: https://arxiv.org/pdf/2010.00200 (accessed on 2 November 2025).
Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models. arXiv 2021, arXiv:2104.08663. Available online: https://arxiv.org/abs/2104.08663 (accessed on 16 October 2025). [CrossRef]
Xiao, C.; Yang, D.; Gong, Y.; Li, H.; Zhang, M.; Wang, J.; Chen, L.; Zhao, K.; Liu, Y.; Hu, S. C-Pack: Packaged Reranking for Retrieval. arXiv 2023, arXiv:2312.07597. Available online: https://arxiv.org/abs/2312.07597 (accessed on 16 October 2025).
Nogueira, R.; Cho, K. Passage Re-ranking with BERT. arXiv 2019, arXiv:1901.04085. Available online: https://arxiv.org/abs/1901.04085 (accessed on 16 October 2025).
Nogueira, R.; Yang, W.; Lin, J.; Cho, K. Multi-Stage Document Ranking with BERT. arXiv 2019, arXiv:1910.14424. Available online: https://arxiv.org/abs/1910.14424 (accessed on 16 October 2025). [CrossRef]
Tseng, Y.-M.; Chen, W.-L.; Chen, C.-C.; Chen, H.H. Evaluating Large Language Models as Expert Annotators. arXiv 2025, arXiv:2508.07827. Available online: https://arxiv.org/abs/2508.07827 (accessed on 16 October 2025). [CrossRef]
Zhang, R.; Li, Y.; Ma, Y.; Zhou, M.; Zou, L. LLMaAA: Making Large Language Models as Active Annotators. Findings of the Association for Computational Linguistics: EMNLP 2023, 2023. Available online: https://aclanthology.org/2023.findings-emnlp.872/ (accessed on 16 October 2025).
Karim, M.M.; Khan, S.; Van, D.H.; Liu, X.; Wang, C.; Qu, Q. Transforming Data Annotation with AI Agents: A Review of Trends and Best Practices. Future Internet 2025, 17, 353. [Google Scholar] [CrossRef]
Kazemi, A.; Natarajan Kalaivendan, S.B.; Wagner, J.; Qadeer, H.; Davis, B. Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection. arXiv 2025, arXiv:2502.15860. Available online: https://arxiv.org/abs/2502.15860 (accessed on 16 October 2025).
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), Xi’an, China, 25–30 July 2020; ACM: New York, NY, USA, 2020; pp. 39–48. [Google Scholar] [CrossRef]
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; Anderson, R. The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv 2023, arXiv:2305.17493. Available online: https://arxiv.org/abs/2305.17493 (accessed on 4 November 2025).
Thakur, N.; Pradeep, R.; Upadhyay, S.; Campos, D.; Craswell, N.; Lin, J. Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges. arXiv 2025, arXiv:2504.15205. Available online: https://www.researchgate.net/publication/390991280 (accessed on 4 November 2025). [CrossRef]
Krippendorff, K. Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Hum. Commun. Res. 2004, 30, 411–433. Available online: https://academic.oup.com/hcr/article/30/3/411/4331534 (accessed on 4 November 2025). [CrossRef]
Dalton, J.; Xiong, C.; Callan, J. TREC CAsT 2019: The Conversational Assistance Track Overview. arXiv 2020, arXiv:2003.13624. Available online: https://arxiv.org/abs/2003.13624 (accessed on 1 November 2025). [CrossRef]
Qian, H.; Xie, Y.; He, C.; Lin, J.; Ma, J. Explicit Query Rewriting for Conversational Dense Retrieval. In Proceedings of the EMNLP2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; ACL: Abu Dhabi, United Arab Emirates, 2022; pp. 10124–10136. Available online: https://aclanthology.org/2022.emnlp-main.311.pdf (accessed on 1 November 2025).
Elgohary, A.; Peskov, D.; Boyd-Graber, J. Can You Unpack That? Learning to Rewrite Questions in Context. In Proceedings of the EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; ACL: Hong Kong, China, 2019; pp. 5717–5726. Available online: https://aclanthology.org/D19-1605/ (accessed on 1 November 2025).

Figure 1. System architecture of the evidence-centric RAG pipeline. The offline stage performs normalization and semantic-aware chunking with stable identifiers [docid#cid]; the online stage conducts hybrid retrieval, RRF fusion, cross-encoder reranking, and generation with inline citations.

Figure 2. Overall citation-F1 on the Yunnan agronomy QA set: comparison between the Simple RAG baseline and Our Model (mean ± SD over three runs).

Figure 3. Bootstrap distribution of document-level macro-F1 for Our Model. Point estimate 0.813; 95% CI [0.781, 0.843]; 346 items; 3 runs; 5000 replicates.

Figure 4. Bootstrap distribution of document-level macro-F1 for the Simple RAG baseline. Point estimate 0.573; 95% CI [0.529, 0.617]; 346 items; 3 runs; 5000 replicates.

Figure 5. Human evaluation label proportions comparing the Simple RAG baseline and Our Model. Bars show the proportions of labels (a) consensus, (b) incorrect content, (c) omission, and (d) potential bias. Proportions are computed over 1038 records per system (346 questions × 3 runs), and a record may receive multiple labels.

Figure 6. Error composition on the Yunnan agronomy QA set (averaged over three runs). WrongChunk: 46.1%; MissingEvidence: 34.4%; WrongDoc: 19.5%; NoCitation: 0.04%.

Figure 7. End-to-end total latency by percentiles over three independent runs.

p 50 \approx 6.27

s (median),

p 90 \approx 12.55

s, and

p 95 \approx 16.28

s indicate a moderate median with a generation-driven tail.

Figure 7. End-to-end total latency by percentiles over three independent runs.

p 50 \approx 6.27

s (median),

p 90 \approx 12.55

s, and

p 95 \approx 16.28

s indicate a moderate median with a generation-driven tail.

Figure 8. Component -wise latency share (average over three runs). Modules: BM25 (sparse retrieval), Dense (vector retrieval), RRF (rank fusion), Reranker (cross-encoder), LLM (generation), and Other (prompt construction, I/O, framework overhead). The LLM dominates (∼91.5%), while retrieval and reranking contribute small, stable shares.

Figure 9. Component -wise latency share (zoomed) for small contributors. Shares of the reranker (0.73%), BM25 (0.48%), dense retrieval (0.42%), and RRF (near zero) are magnified for readability.

Table 1. Concise comparison of LLM integration choices and trade-offs (selected literature: [13,14,15,16,17]).

Architecture	Primary Strengths	Limitations/Risks
Closed-book (prompt-only) LLM	Minimal integration; fast iteration	Weak traceability; slow knowledge refresh; higher hallucination risk
Domain fine-tuned LLM	Encodes domain priors; stable task behavior	Data/compute cost; version drift; re-training required to update knowledge
Long-context (context stuffing)	Direct conditioning on source text; simple pipeline	Position sensitivity; latency/cost grow with window size [13]
Classic RAG (retrieve–rerank–generate)	Hot updates; evidence traceability; mature tooling	Depends on chunking/coverage; generator-dominated latency [14,15]
On-demand/self-reflective RAG	Adaptive retrieval; better factuality with retrieval economy	Added control complexity; extra prompt budget [16]
Tool/function-calling pipeline	Deterministic access to DB/APIs; strong guardrails	Integration/maintenance overhead; limited unstructured reasoning
KG-augmented LLM	Structural consistency; query interpretability	KG construction/curation cost; coverage/freshness gaps
Freshness-aware/online-search RAG	Up-to-date knowledge; resilience to world drift	Source volatility; caching/compliance; variable latency [17]

Table 2. Comparison of DeepSeek-V2, Qwen2.5, LLaMA-3.1, and DeepSeek-V3 across benchmarks.

Benchmark (Metric)	DeepSeek-V2 Base	Qwen2.5 72B	LLaMA-3.1 405B Base	DeepSeek-V3 MoE 236B
Architecture	MoE	Dense	Dense	MoE
Activated Params	21B	72B	405B	37B
Total Params	236B	72B	405B	671B
English
Pile-test (BPB)	0.606	0.638	0.542	0.548
BBH (EM, 3-shot)	78.8	79.8	82.9	87.5
MMLU (EM, 5-shot)	78.4	85.0	84.4	87.1
MMLU-Redux (EM, 5-shot)	75.6	83.2	81.3	86.2
MMLU-Pro (EM, 5-shot)	51.4	58.3	52.8	64.4
DROP (F1, 3-shot)	80.4	80.6	86.0	89.0
ARC-Easy (EM, 25-shot)	97.6	98.4	98.4	98.9
ARC-Challenge (EM, 25-shot)	92.2	94.5	95.3	95.3
HellaSwag (EM, 10-shot)	87.1	84.8	89.2	88.9
PIQA (EM, 0-shot)	83.9	82.6	85.9	84.7
WinoGrande (EM, 5-shot)	86.3	82.3	85.2	84.9
RACE-Middle (EM, 5-shot)	73.1	68.1	74.2	67.1
RACE-High (EM, 5-shot)	52.6	50.3	56.8	51.3
TriviaQA (EM, 5-shot)	80.0	71.9	82.7	82.9
NaturalQuestions (EM, 5-shot)	38.6	33.2	41.5	40.0
AGIEval (EM, 0-shot)	57.5	75.8	60.6	79.6
Code
HumanEval (Pass@1, 0-shot)	43.3	53.0	54.9	65.2
MBPP (Pass@1, 3-shot)	65.0	72.6	68.4	75.4
LiveCodeBench-Base (Pass@1, 3-shot)	11.6	12.9	15.5	19.4
CRUXEval-I (EM, 2-shot)	52.5	59.1	58.5	67.3
CRUXEval-O (EM, 2-shot)	49.8	59.9	59.9	69.8
Math
GSM8K (EM, 8-shot)	81.6	88.3	83.5	89.3
MATH (EM, 4-shot)	43.4	54.4	49.0	61.6
MGSM (EM, 8-shot)	63.6	76.2	69.9	79.8
CMath (EM, 3-shot)	78.7	84.5	77.3	90.7
Chinese
CLUEWSC (EM, 5-shot)	82.0	82.5	83.0	82.7
C-Eval (EM, 5-shot)	81.4	89.2	72.5	90.1
CMMLU (EM, 5-shot)	84.0	89.5	73.7	88.8
CMRC (EM, 1-shot)	77.4	75.8	76.0	76.3
C3 (EM, 0-shot)	77.4	76.7	79.7	78.6
CCPM (EM, 0-shot)	93.0	88.5	78.6	92.0
Multilingual MMMLU-non-English (EM, 5-shot)	64.0	74.8	73.8	79.4

Table 3. Text segments with stable_id, position, token count, and preview.

stable_id	Position	Tokens	Preview
e76faa05#1	0	12	Arabica coffee originated in Ethiopia’s high-mountain forests. After entering commercial cultivation, full-sun systems were adopted to pursue high yields. With advances in agronomy and greater attention to plant health, many producing countries began to value shaded cultivation. Whether shade is needed mainly depends on latitude, elevation, and local climate. Coffee is a high-yield, high-nutrient-demand crop; shaded cultivation is lower-cost, lower-risk, and offers good returns, whereas full-sun cultivation requires strong water and fertilizer inputs if insufficient, plants may over-flower and fruit, leading to dieback before scaffold branches are established.
e76faa05#2	1	14	Under global climate change, countries such as Brazil, Indonesia, Vietnam, and India have experienced disease outbreaks and yield loss after drought followed by heavy rainfall. Traditional non-rust-resistant cultivars widely planted in producing countries (e.g., Bourbon, Typica, Caturra, Catuai) are prone to severe coffee leaf-rust epidemics, seriously affecting production. Accordingly, some countries have emphasized shaded cultivation to create a more suitable micro-environment and stabilize yields. Examples with shade include India, Ethiopia, Colombia, Costa Rica, Mexico, Kenya, and Madagascar; regions commonly without shade include Brazil, Venezuela, Hawaii (USA), Indonesia’s Bali and Sumatra, as well as Malaysia and Uganda.
e76faa05#3	2	9	Because Brazil predominantly grows coffee without shade, Catimor has historically seen limited planting due to concerns about excessive yield, over-fruiting, and early senescence. Recently, Brazil has paid greater attention to shade: studies in the south indicate that, under agroforestry systems, moderate shade can improve production efficiency. In India, coffee is almost universally shaded, often using a two-tier system short-term lower shade trees intercropped with permanent upper shade trees to create a shaded yet growth-friendly environment.

Table 4. Ablation results on the Yunnan agronomy QA set. Metrics are citation-based F1 (per-sample macro); Precision and recall are shown for completeness. Values are means across runs; the Simple RAG baseline is reported for reference.

Setting	Precision	Recall	F1
Our Model (Full RAG)	0.768	0.918	0.813
B1: No Reranker	0.706	0.816	0.739
B2a: Dense-only	0.638	0.793	0.685
B2b: BM25-only	0.765	0.908	0.806
B3: Embed Fallback	0.759	0.887	0.797
B4: Replace Reranker	0.638	0.773	0.679
B5: No Query Rewrite and History (No HAR and IHI)	0.763	0.912	0.808
Simple RAG Baseline	0.534	0.665	0.573

Table 5. Comparison of chunking strategies on the Yunnan agronomy QA set. Sliding-window parameters: window = 500 characters, overlap = 100 characters; sentence-boundary snapping enabled (lookback = 80 characters). Metrics are citation-based F1 (per-sample macro); precision and recall are shown for completeness. Values are means across runs.

Setting	Precision	Recall	F1
Sliding Window (Fixed-length, Overlapped)	0.750	0.847	0.780
Our Model (Full RAG)	0.785	0.893	0.817

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Z.; Jiang, Z.; Yang, J. A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation. Agriculture 2025, 15, 2381. https://doi.org/10.3390/agriculture15222381

AMA Style

Chen Z, Jiang Z, Yang J. A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation. Agriculture. 2025; 15(22):2381. https://doi.org/10.3390/agriculture15222381

Chicago/Turabian Style

Chen, Zheng, Zihao Jiang, and Jianping Yang. 2025. "A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation" Agriculture 15, no. 22: 2381. https://doi.org/10.3390/agriculture15222381

APA Style

Chen, Z., Jiang, Z., & Yang, J. (2025). A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation. Agriculture, 15(22), 2381. https://doi.org/10.3390/agriculture15222381

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A RAG-Augmented LLM for Yunnan Arabica Coffee Cultivation

Abstract

1. Introduction

2. Materials and Methods

2.1. RAG Pipeline

2.2. Knowledge Base Construction and Stable Citation Identifiers

2.3. Hybrid Retrieval and Fusion (RRF)

2.4. Cross-Encoder Reranking

2.5. History-Aware Query Rewriting and In-Prompt History Injection (HAR and IHI)

2.6. Evaluation Data and Gold Construction

3. Results

3.1. Overall Performance

3.2. Human Evaluation

3.3. Ablation Studies

3.4. Error Analysis

3.5. Latency and Cost Breakdown

4. Discussion

4.1. Restatement of Main Findings and Contributions

4.2. Insights from the Error Structure

4.3. Implications for Latency and Cost

4.4. Limitations of Corpus Scale and Gold Labels

4.5. On HAR/IHI Effectiveness and Improvements

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Implementation Details for Reproducibility

Appendix A.1. Note on Prompt Templates

Appendix A.2. LLM Decoding Hyperparameters

Appendix A.3. Chunking Thresholds (Semantic-Aware Splitting)

Appendix A.4. Retrieval and Reranking Defaults

Appendix A.5. Hardware and Throughput Settings

Appendix A.6. Gold-QA Synthesis Prompts (English Translations)

Appendix A.7. Gold-QA Decoding Hyperparameters (Overrides)

Appendix A.8. Prompt Templates (English Translations)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI