1. Introduction
With the increasing demand for precision and intelligent technology in agricultural production, the agricultural sector is facing increasingly complex challenges. As an important grain crop, corn cultivation management involves various aspects of professional knowledge such as variety selection, cultivation techniques, and pest control. The demand for relevant technical support from agricultural practitioners continues to increase. Traditional information acquisition channels, such as books, technical manuals, Internet searches, etc., are widely used, but in the rapidly changing and specialized agricultural production environment, there are often problems such as low efficiency of information acquisition, lack of professionalism in content, and a lack of personalized services, which make it difficult to meet actual needs.
In recent years, large language models (LLMs) have made significant breakthroughs in the field of Natural Language Processing (NLP). Initially, BERT introduced a bidirectional Transformer architecture [
1], which effectively improved the model’s ability to understand semantics [
2]. Subsequently, the GPT series significantly optimized the coherence and creativity of text generation by adopting an autoregressive generation mechanism [
3]. Afterwards, the T5 model further innovatively formulated multiple natural language tasks as a “text to text” conversion problem, thereby enhancing the model’s task generalization ability [
4]. On this basis, large-scale models such as PaLM have significantly expanded their parameter scale, further improving their performance in context understanding and complex reasoning tasks [
5]. Researchers have begun to explore the application of LLMs in vertical fields: Yang et al. [
6] proposed a medical question-answering (MedQ&A) method that utilized the inherent medical knowledge and reasoning ability of the Llama large language model to improve the classification performance for medical question-answering under zero-sample learning, providing a new method and practical foundation for the application of LLMs in vertical fields. Md. Salim et al. [
7] developed a web application called LLM Q&A Builder, which integrates data preparation and model fine-tuning development, making it easy for both technical and non-technical users to build internal Q&A systems for organizations, providing methodological and tool support for LLMs in different fields such as enterprise information retrieval.
Although LLMs’ language comprehension and generation capabilities provide new opportunities for agricultural technology support, there are still limitations in their application in the agricultural field. On the one hand, LLMs rely on training data inference, and closed knowledge systems make it difficult to ensure the real-time and authoritative nature of the information, resulting in problems such as lagging updates and incomplete information; on the other hand, general LLMs lack professional knowledge training in the field of agriculture and are prone to “hallucinations” [
8], among which factual errors, hollow content, and logical confusion are the most typical problems, limiting their practicality in agricultural intelligent question answering.
To overcome the problems of knowledge update lag and an insufficient professional Q&A ability in LLMs, retrieval-augmented generation (RAG) [
9,
10] methods have been widely applied. RAG combines information retrieval and content generation technologies to achieve fast and efficient retrieval from massive amounts of information [
11,
12]. Importantly, RAG has been increasingly recognized as a fundamental baseline to beat in knowledge-intensive tasks [
13]. The core of this technology is to retrieve the text chunks that are closest to the query from the relevant document knowledge base based on user queries and then integrate this text chunk information with input prompt words (Prompt) [
14] into context and input it into the large language model as the generation module, thereby enhancing the ability to reference the latest and authoritative knowledge and improving the professionalism and credibility of the answers [
15]. At present, this method has been validated for effectiveness in knowledge-intensive fields such as healthcare, agriculture, cybersecurity, and food. Zakka et al. [
16] developed the Almanac framework and applied RAG technology to clinical decision support. This framework significantly improves the accuracy and completeness of generated answers in clinical settings by integrating medical guidelines. However, the framework still has limitations in handling queries that cannot directly extract answers from the guide, which suggests the need for cautious deployment and risk mitigation measures in practical applications. Malali et al. [
17] studied the application of RAG in financial document processing, proposed using RAG to automate compliance and regulatory reporting processes, and verified its advantages in improving data accuracy and decision quality. Research has pointed out that the shortcomings of RAG include incomplete accuracy of the results, dependence on external data quality, and insufficient transparency in the retrieval and generation process. Su et al. [
18] proposed parameterized RAG, which integrates external knowledge into LLMs through document parameterization to address the limitations of contextual knowledge enhancement methods. Experimental results show that parameterized RAG significantly improves the effectiveness of LLM knowledge enhancement, but the parameterization process is computationally expensive, and the parameterized representation of each document far exceeds the pure text, limiting its scalability; Ding et al. [
19] proposed the RealGen framework, which solves the problem of traditional methods finding it difficult to generate unseen scenes by retrieving existing traffic scene examples and generating new scenes based on these contexts. However, the limitation of this framework is that the retrieved examples and generative models may not fully capture complex behavioral contexts, and the input features retrieved and generated are insufficient; Wang et al. [
20] proposed MADAM-RAG, a multi-turn question-answering framework based on retrieval-augmented generation (RAG), which combines retrieved multi-document evidence with LLMs and introduces self-reflection and critical evaluation mechanisms to achieve robust handling of potential misleading information. This method ensures accuracy in generating answers, but its generation performance is highly dependent on the quality and completeness of the retrieved documents. When there is misleading information or an uneven distribution in the documents, the performance of the model may be affected; Patrice Béchard [
21] proposed a system based on RAG to generate structured outputs of enterprise workflows from natural language requirements, addressing the issue of “hallucinations” in generative artificial intelligence. While combining external retrieval information to reduce hallucinations and improve the generalization ability of LLMs, there are still limitations such as incomplete elimination of hallucinations, dependence on post-processing, and insufficient collaboration between retrievers and LLMs; Shao et al. [
22] proposed the ITER-RETGEN method to enhance retrieval LLMs, using the model’s initial response to task inputs as context-guided retrieval to obtain more relevant knowledge to improve the next round of generation results. This method improves the performance of retrieval generation but still has limitations in prompt optimization and not covering long-text generation tasks; Ovadia et al. [
23] compared the effectiveness of unsupervised fine-tuning and RAG in knowledge-intensive tasks and found that RAG outperforms unsupervised fine-tuning in handling both existing and new knowledge during training. They also pointed out that LLMs face difficulties in learning new factual information through unsupervised fine-tuning, and exposing multiple variants of the same fact can alleviate this problem, thus demonstrating the advantages of RAG in expanding and updating LLM knowledge; Lameck et al. [
24] conducted a systematic review of the application of RAG-based large language models in the medical field, dividing RAG methods into three paradigms, naive, advanced, and modular, and summarizing the evaluation frameworks and indicators. This review provides a reference for understanding the advantages, limitations, and research gaps of RAG in vertical fields.
Despite the broad prospects of RAG technology, it still faces many challenges. Although the retrieval module of RAG greatly improves the efficiency and accuracy of model retrieval, its effectiveness still needs to be improved when facing fuzzy queries and niche knowledge domains [
25]. After obtaining the search results, it is also necessary to consider whether the length of the text can smoothly enter the generator. After inputting the search results into the query generator, it is necessary to think about how to integrate the information into the answer, so as to align the searcher with the generator. In addition, the computational cost of the model used in RAG technology also needs to be considered, and so on. Specifically, in the field of corn cultivation in agriculture, the scale of knowledge information used for retrieval is large, and the semantic connections between contextual information are relatively close. In line with recent advances in graph-oriented medical RAG that leverage structured, community-aware designs to mitigate hallucinations [
26], our approach further emphasizes community-level knowledge structuring to enhance reliability and reduce the risk of misleading responses. Traditional RAG retrieval methods find it difficult to fully express the hierarchical and logical relationships of agricultural knowledge, which limits the knowledge reasoning ability of large language models in agricultural question-answering.
In this study, we aim to address the following research question: How can we enhance the retrieval accuracy and reduce hallucinations in large-language-model-based question-answering systems for knowledge-intensive agricultural domains, such as corn cultivation? Therefore, in order to extract rich semantic information from external knowledge documents as much as possible for corn planting knowledge question-answering tasks in the agricultural field, we propose Sem-RAG, a novel dual-store semantic retrieval framework that combines chunk-level embeddings and community-level thematic summaries to improve both the semantic coverage and answer reliability.
Distinct from standard NaiveRAG, which only retrieves fixed-length text chunks, and GraphRAG, which mainly relies on graph connections, Sem-RAG introduces a dual-store retrieval design. It builds both a surface semantic store that preserves chunk-level embeddings and a fine-grained semantic store constructed from Leiden-based community summaries. The latter not only compresses text but also provides thematic-level semantic aggregation across chunks, which substantially improves answer quality by enhancing knowledge associations and reducing noise. By jointly retrieving from both stores, Sem-RAG achieves a balance between local semantic detail and higher-level contextual reasoning.
As shown above, the main contributions of this paper are as follows:
We developed a knowledge question-answering method, Sem-RAG, for performing knowledge question-answering tasks in the agricultural domain, specifically for corn planting. This method is based on the idea of fine-grained semantic capture and enhances the performance of large language models in the aspect of knowledge reasoning by combining surface semantics with context-related semantics.
We designed a graph topic module, Graph-Content, for capturing and processing context-related semantic association information from knowledge documents. This module captures triple information with semantic associations from knowledge documents and then performs community layering and a thematic community summary on the captured semantic triples according to the Leiden algorithm, which serves as context-related semantics.
We independently constructed a knowledge question-answering dataset, CornData, suitable for evaluating the Sem-RAG method and conducted experimental verification on the Sem-RAG method to demonstrate its superior performance.
4. Conclusions
This study aims to address the issues of hallucination and insufficient accuracy in retrieval-generated responses of LLMs and RAG techniques in knowledge-intensive scenarios, particularly in agriculture. To enhance the accuracy and reliability of corn planting knowledge question-answering tasks, a fine-grained semantic information retrieval-enhanced algorithm, named Sem-RAG, is proposed. This algorithm first divides professional documents on corn planting knowledge into fixed-length chunks; then, it extracts semantically associated triples from each chunk; next, the semantic triples in each chunk are organized into communities using the Leiden algorithm, and thematic community summaries are generated; afterwards, the original text of each chunk and the thematic summaries are separately vectorized to construct a surface semantic vector knowledge base and a fine-grained semantic vector knowledge base for retrieval; then, the user query vector is first used to perform similar-chunk retrieval on the surface semantic knowledge base, and through these similar chunks, the corresponding graph community summaries in the fine-grained semantic knowledge base are located; finally, the information from similar chunks and the graph community summaries is used as semantic reference for the LLM to generate the query response.
The algorithm was evaluated on CornData for corn knowledge question-answering. Its Answer-C, Answer-R, and CR scores reached 94.6%, 84.6%, and 70.4%, respectively, which were 2.6%, 1.7%, and 1.6% higher than those of the traditional NaiveRAG method. The results indicate that Sem-RAG effectively combines surface semantic information with contextual semantic association information, significantly improving the performance in agricultural knowledge question-answering tasks.
While Sem-RAG demonstrates strong retrieval and reasoning capabilities across diverse benchmarks, several limitations remain. First, the model’s performance may degrade when the input data are noisy, ambiguous, or incomplete, as the semantic retrieval module relies on high-quality representations. Future work could explore noise-robust embedding strategies or adaptive confidence mechanisms to mitigate this issue. Second, although our method scales reasonably well with medium-sized corpora, the computational costs increase with corpus size due to the semantic similarity computations required during retrieval. Techniques such as approximate nearest neighbor (ANN) search, vector compression, or hierarchical retrieval could improve scalability and efficiency. Finally, our current evaluation primarily focuses on static datasets; extending Sem-RAG to dynamic or streaming contexts remains an open challenge.
In the future, we plan to further investigate potential hardware deployment issues that may arise in practical applications. In addition, improving the algorithmic performance of corn planting knowledge question-answering through multi-agent collaboration is an important aspect to consider. We also aim to explore extending purely text-based knowledge question-answering to image and video modalities to enhance the model performance, leveraging technologies such as diffusion models and Sora.