A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering

Mansurova, Aigerim; Tleubayeva, Arailym; Nugumanova, Aliya; Shomanov, Adai; Seker, Sadi Evren

doi:10.3390/info16110943

Open AccessArticle

A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering

by

Aigerim Mansurova

^1,*

,

Arailym Tleubayeva

^2,*

,

Aliya Nugumanova

¹

,

Adai Shomanov

³ and

Sadi Evren Seker

⁴

¹

Big Data and Blockchain Technologies Research and Innovation Center, Astana IT University, Astana 020000, Kazakhstan

²

School of Artificial Intelligence and Data Science, Astana IT University, Astana 020000, Kazakhstan

³

Computer Science Department, Nazarbayev University, Astana 020000, Kazakhstan

⁴

Department of Computer Engineering, Faculty of Computer and Information Technologies, Istanbul University, 34320 Istanbul, Turkey

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(11), 943; https://doi.org/10.3390/info16110943 (registering DOI)

Submission received: 15 September 2025 / Revised: 20 October 2025 / Accepted: 23 October 2025 / Published: 30 October 2025

(This article belongs to the Special Issue Advanced Retrieval-Augmented Generation Systems Based on Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a systematic evaluation of large language models (LLMs) and retrieval-augmented generation (RAG) approaches for question answering (QA) in the low-resource Kazakh language. We assess the performance of existing proprietary (GPT-4o, Gemini 2.5-flash) and open-source Kazakh-oriented models (KazLLM-8B, Sherkala-8B, Irbis-7B) across closed-book and RAG settings. Within a three-stage evaluation framework we benchmark retriever quality, examine LLM abilities such as knowledge-gap detection, external truth integration and context grounding, and measures gains from realistic end-to-end RAG pipelines. Our results show a clear pattern: proprietary models lead in closed-book QA, but RAG narrows the gap substantially. Under the Ideal RAG setting, KazLLM-8B improves from its closed-book baseline of 0.427 to reach answer correctness of 0.867, closely matching GPT-4o’s score of 0.869. In the end-to-end RAG setup, KazLLM-8B paired with Snowflake retriever achieved answer correctness up to 0.754, surpassing GPT-4o’s best score of 0.632. Despite improvements, RAG outcomes show an inconsistency: high retrieval metrics do not guarantee high QA system accuracy. The findings highlight the importance of retrievers and context grounding strategies in enabling open-source Kazakh models to deliver competitive QA performance in a low-resource setting.

Keywords:

large language models (LLM); Kazakh language; low-resource language; question answering (QA) system; retrieval augmented generation (RAG); information retrieval (IR); language model evaluation; sentence embeddings

Graphical Abstract

1. Introduction

Large language models (LLMs) are a cutting-edge artificial intelligence (AI) technology that has dramatically advanced question-answering (QA) and many natural language processing (NLP) tasks in high-resource languages [1,2], yielding significant productivity gains and economic benefits [3]. However, these gains are unevenly distributed, as LLM development remains concentrated in a few dominant languages (such as English and Chinese), thereby risking a widening digital divide between resource-rich and low-resourced language communities [4].

Training state-of-the-art LLMs requires massive text corpora and significant computational resources, which are extremely limited for languages like Kazakh [5]. As a result, Kazakh, a Turkic language spoken by around 15 million native speakers, remains underrepresented in NLP research and resources [6]. The problem arises from both data scarcity and linguistic features (agglutinative morphology and complex suffixation), which leave general LLMs poorly equipped to generate grammatically correct and factually accurate answers in Kazakh [5].

In the last few years, there has been a trend toward closing this gap. Several notable open-source models such as Sherkala [7], KazLLM [8], AlemLLM [9], and the model introduced by Kadyrbek et al. [10] have been developed specifically for Kazakh. These models reflect a broader shift from general-purpose multilingual LLMs to language-specific or regionally adapted systems, aiming to better capture linguistic nuance and improve performance in real-world Kazakh applications. However, their performance in QA tasks remain largely underexplored by different reasons. For instance, AlemLLM [9] has no accompanying publication detailing its architecture or training methodology, and its deployment requires substantial hardware resources—approximately eight NVIDIA A100 GPUs. Similarly, the model introduced by Kadyrbek et al. [10] is not openly released, and no public access or implementation link is provided, which limits reproducibility and broader community adoption.

Question-answering refers to the task of automatically generating the response to user question phrased in natural language. It focuses on delivering concise answers rather than retrieving full documents [11]. Depending on the nature of the reasoning required, QA systems are categorized as either open-book and closed-book QA [12]. In closed-book QA, the model must generate answers based solely on its internal knowledge without access to any external documents or supporting context at inference time [13].

A major limitation of LLMs in a closed-book QA setting is their tendency to hallucinate [14]. In contrast, open-book QA provides the model with access to external knowledge sources at query time. A particular approach in this category is Retrieval-Augmented Generation (RAG), which couples an LLM with an external information search engine [15]. By retrieving relevant documents at query time, RAG can provide the language model with up-to-date, factual context, thereby increasing accuracy and reducing hallucinations [16]. RAG and related embedding-based retrieval methods have demonstrated significant improvements in the accuracy of English QA systems [17,18], but their effectiveness on low-resource languages like Kazakh remains unexplored.

The above-mentioned gaps facilitate the research. The study aims to systematically evaluate how state-of-the-art LLMs perform on Kazakh QA system, and whether augmenting them with retrieval can significantly improve their effectiveness. Thus, the study addresses the following research questions:

How do proprietary and open-source LLMs perform on closed-book QA tasks in Kazakh?
Does retrieval-augmented generation (RAG) improve QA accuracy over closed-book generation for Kazakh?

By addressing these research questions, this study contributes to the ongoing efforts to extend the advantages of large language models to low-resource languages. Bridging the QA gap for the Kazakh language represents not only a technical challenge but also a crucial step toward developing more inclusive and equitable AI systems, ensuring that speakers of under-resourced languages can equally benefit from advancements in QA technologies.

The remainder of this paper is organized as follows: Section 2 summarizes current studies on RAG for enhancing QA performance and QA in the Kazakh language. Section 3 outlines experimental design used to compare closed-book and retrieval-augmented QA approaches. Section 4 describes the models and datasets employed in the study. Section 5 presents the experimental result, followed by Section 6, which provides an error analysis. Section 7 is dedicated to discussion of the findings. Section 8 outlines the study’s limitations. Finally, Section 9 concludes the study by answering the research questions and outlining prospects for further research.

2. Literature Review

2.1. Addressing Hallucination in LLMs Through Retrieval-Augmented Generation

Large language models have a well-documented tendency to hallucinate—that is, to produce fluent but factually incorrect information [14,18]. Research has highlighted that even advanced models can generate plausible-sounding answers that are not grounded, undermining user trust in AI systems [19,20]. This issue is especially problematic in question-answering tasks where factual accuracy is critical.

A potential solution is Retrieval-Augmented Generation (RAG), which integrates an external information retrieval component into the answer generation process [21]. Lewis et al. [15] introduced the RAG concept, demonstrating that coupling a neural retriever with a generative model significantly improves performance on knowledge-intensive QA tasks. RAG-enhanced QA systems reduce hallucinations by first retrieving relevant evidence from external knowledge sources. They then condition the language model’s response on this retrieved context, ensuring that answers are factually grounded rather than solely dependent on the model’s internal memory. Subsequent studies have reinforced these benefits: for instance, Patel et al. [22] showed that using dense embedding-based retrieval to supply relevant passages yields significant accuracy gains in multilingual QA, as the model’s answers can be directly grounded in the retrieved text. Pingua et al. [23] reported that on the MedQuAD benchmark, even after domain-specific fine-tuning, adding a retrieval layer drove the largest boosts in factual accuracy.

Fine-tuning and RAG represent complementary yet distinct strategies for integrating knowledge into large language models [23]. Fine-tuning embeds information directly into model parameters, offering strong task or domain adaptation but at high computational cost and limited flexibility. Once trained, the model’s knowledge remains static, requiring retraining to reflect updates [24]. In contrast, RAG decouples knowledge from model weights by incorporating an external retrieval component, allowing models to dynamically access and integrate up-to-date information at inference time. This design enables continuous knowledge refresh without retraining, providing a more scalable and cost-effective approach for maintaining factual accuracy [25].

RAG approaches have become standard in high-resource language QA systems. A range of pipelines, such as multi-agent RAG frameworks [26,27] and Graph-based retrieval [28,29], have delivered state-of-the-art results on English benchmarks by combining IR (information retrieval) with generation. These systems outperform even fine-tuned parametric LLMs on knowledge-specific questions, confirming that integrating external information is essential for reducing hallucinations.

Despite these achievements, most existing RAG research still reflects the limitations of traditional information extraction validation practices. A central and rarely questioned assumption persists that accurate retrieval automatically guarantees correct answers. This assumption shapes both evaluation and design strategies, similar to conventional extraction frameworks that measure output accuracy without examining reasoning quality. Consequently, system improvements are overwhelmingly retrieval-centric: expanding context windows or long-context architectures [30,31], refining chunking, re-ranking, and fine-tuning retrievers [32,33,34], or employing multi-agent and multi-perspective strategies [35,36]. In these approaches, the generator passively consumes retrieved text, similar to traditional information extraction systems that assume validity once the input is relevant.

The proposed method differentiates itself by shifting the analytical focus from retrieval accuracy to the reasoning dynamics of the generator component. Unlike traditional RAG or extraction-based evaluation, the proposed approach explicitly examines whether the generator can (1) abstain from answering when retrieved evidence is insufficient, (2) reconcile conflicting or ambiguous information, and (3) attribute knowledge correctly between retrieved content and parametric memory. This perspective extends beyond conventional information extraction validation. Such an approach enables a deeper understanding of how generation interacts with evidence, paving the way for RAG systems that are not only more accurate but also interpretable, robust, and reliable in real-world QA scenarios.

2.2. State of Question Answering in the Kazakh Language

Kazakh is a low-resource language for which QA technology is only beginning to mature. Previous efforts to build Kazakh QA systems leveraged multilingual NLP models and focused on extractive QA. For example, Shymbayev and Alimzhanov [37] constructed a Kazakh QA dataset and fine-tuned transformer models to retrieve answers from provided paragraphs. Their system achieved respectable accuracy on test questions, underlining that multilingual transformers like XLM-R or mBERT can serve as effective readers for Kazakh given the right fine-tuning data. In parallel, efforts have been made to tackle domain-specific QA. Mukanova et al. [38] developed a geographical QA system in Kazakh. They compiled a corpus of 50,000 Q&A pairs about Kazakhstan’s geography and trained a BERT-based model to answer these questions. The resulting system could answer geography questions with high fidelity (BLEU score ≈ 0.95). Tleubayeva and Shomanov [39] performed a comprehensive comparison of multilingual QA models adapted to Kazakh. They reported that a fine-tuned mT5 model achieved the highest accuracy (F1 ≈ 75.7%) on a Kazakh QA test set, slightly outperforming even GPT-4’s few-shot performance (F1 ≈ 73.3%) on the same questions. Their experiments also showed that Kazakh-RoBERTa improved with fine-tuning but still lagged behind the larger multilingual models, highlighting the need for more Kazakh-specific data and larger model capacity. Nugumanova et al. [40] similarly highlights the importance of leveraging powerful multilingual transformers for developing a Kazakh QA system. By adapting a T5 model to Kazakh, they achieved improved results over baseline models, confirming that pretrained models can transfer to Kazakh after fine-tuning. However, fine-tuning leads to narrow specialization which fails to generalize beyond the training distribution, while still demanding costly annotation efforts [41].

Further highlighting the challenges in Kazakh QA, Maxutov et al. [42] conducted a comparative evaluation of seven proprietary and open-source LLMs across multiple Kazakh NLP tasks, including question answering. Their findings show that the overall quality of LLMs on Kazakh tasks is significantly lower than on parallel English tasks. GPT-4 performed best, followed by Gemini and AYA, but even these models struggled with open-ended generative QA. The authors noted that LLMs generally performed better on classification tasks and often failed to generate accurate or complete answers in Kazakh, with Gemini showing a high rate of empty outputs.

Recent applied studies also highlight the effectiveness of multilingual embeddings and LLM-based approaches in related tasks. Nugumanova et al. [43] demonstrated that global embedding models (BGE-M3, LaBSE), combined with local signals, can successfully address the problem of sentiment analysis of transport complaints in a zero-shot setting and with lightweight fine-tuning. Although this work does not focus directly on QA or the Kazakh language, it underlines the value of incorporating retrieval and embedding components to improve factual accuracy in low-resource scenarios. Similarly, Rakhimzhanov et al. [44] confirmed that the embedding models E5 and BGE-M3, applied in multilingual low-resource settings (in this case, the classification of transport complaints), achieve accuracies of up to 90% even without task-specific training, while being significantly faster and more computationally efficient than LLM-based approaches.

In response to these challenges, two notable open-source LLMs such as ISSAI’s KazLLM [8] and MBZUAI’s Sherkala [7] were released recently. KazLLM, unveiled in late 2024 by the Institute of Smart Systems and AI (Nazarbayev University), is a large-scale model (available in 8B and 70B parameters) trained on a massive multilingual text corpus including 150B tokens of Kazakh, Russian, English, and Turkish [8]. Similarly, Koto et al. [7] released Sherkala-8B-Chat, a LLaMA-based 8-billion-parameter model instruction-tuned for Kazakh. Sherkala was trained on over 45 billion tokens with an emphasis on Kazakh and other regional languages, and it significantly outperforms previous open models on Kazakh language tasks. These developments are indicative of broader changes as the community begins to create LLMs specifically designed for the Kazakh language to better handle the nuances of Kazakh text.

AlemLLM [9], developed by Astana Hub, represents another effort to adapt LLMs for Kazakh. This multilingual model with 247B parameters covers Kazakh, Russian, Turkish, and English, but requires substantial computational resources for deployment (approximately eight NVIDIA A100 GPUs with at least 40 GB VRAM each). Similarly, Kadyrbek et al. [10] introduced a 1.94B LLaMA-based model trained on a cleaned Kazakh corpus using a custom tokenizer and Direct Preference Optimization alignment, but it is not publicly released making their results non-reproducible.

The existing literature reveals a clear research gap that this study aims to address. While substantial evidence shows that Retrieval-Augmented Generation (RAG) reduces hallucinations and improves accuracy in high-resource QA systems, Kazakh QA models—despite progress through fine-tuning and new releases—still struggle with factual coverage and generative quality. Yet, no prior work has explored applying RAG in a low-resource QA context. This study therefore investigates how a retrieval-assisted approach can enhance Kazakh QA performance.

3. Methodology

This study conducts a systematic evaluation of both proprietary and open-source LLMs for question answering in Kazakh, a linguistically rich yet low-resource language. Two experimental configurations are investigated:

Closed-book QA—the LLM generates answers exclusively from its internal parametric knowledge, without access to external information sources.
Retrieval-Augmented Generation (RAG)—the LLM receives additional context passages retrieved from an external knowledge base, enabling it to ground responses in relevant, verifiable evidence.

Figure 1 presents the RAG architecture. The system proceeds in four stages. First, source documents

D

are ingested and preprocessed (cleaning, normalization, and chunking), then encoded with a sentence-embedding model; the resulting passage vectors are stored in a vector index. Second, at inference time the user query

q

is embedded in the same space and a retriever executes similarity search to select the

t o p - k

passages

D_{r} \subset D

that are most relevant. Third, a structured prompt is assembled that concatenates the original question with the retrieved evidence and explicitly instructs the generator to ground its response only in this context. Fourth, a generator (large language model) produces an answer

a

. The example in Figure 1 (“Дoмбыра деген не?”/“What is a dombra?”) illustrates the end-to-end process: relevant descriptions of the dombra are retrieved from the corpus and synthesized by the generator into a context-grounded definition.

Figure 2 illustrates the three-stage evaluation framework used to assess RAG approach against a closed-book baseline. At stage 1 we evaluate multiple sentence embedding models to identify the retriever configuration that maximizes recall. Stage 2 evaluates generators under the controlled conditions—(2.1) knowledge-gap detection, (2.2) external-truth integration under conflict, and (2.3) ideal-RAG impact with gold passages provided. Stage 3 runs an end-to-end comparison of the retriever-generator pairing and the closed-book baseline on the same question set.

Table 1 summarizes the common configuration settings applied to both retrieval and generation stages, ensuring consistent and reproducible evaluation conditions. All experiments were conducted in Google Colab Pro+, implemented in Python using the LangChain [45] framework with FAISS [46] as the vector store. Each experiment was run three times, and the mean ± standard deviation (SD) along with 95% confidence intervals (CIs) are reported.

All experimental data, codes and configurations are publicly available in the GitHub repository listed in the Data Availability Statement.

3.1. Retriever Evaluation

Before integrating the retrieval module into the RAG framework, the systematic comparative analysis of multiple candidate sentence-embedding models was completed to determine the most effective dense retriever for Kazakh-language passages.

To rigorously assess the effectiveness of the retriever, three complementary metrics were employed which is designed to capture both the system’s ability to identify relevant results and the quality of their ranking:

Recall@k measures the proportion of queries for which at least one relevant passage appears among the top-k (k = 1, 3, 5, 10) retrieved results.

Formally, Recall@k is defined as:

Recall k = \frac{1}{|Q|} \sum_{q \in Q} 1 (rank (q) \leq k),

(1)

where

Q is the set of all queries,
rank(q) is the rank of the first relevant document for query q,
1(⋅) is the indicator function, equal to 1 if its argument is true and 0 otherwise.

A higher Recall@k indicates better retrieval coverage, ensuring the LLM receives sufficient contextual information.

2.: Mean Reciprocal Rank (MRR) reflects the average ranking position of the first relevant passage across all queries, with higher weight given to top-ranked results:

MRR = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{rank (q)},

(2)

where rank(q) is the position of the first relevant passage for query q. MRR complements Recall@k by penalizing cases where relevant passages are retrieved but ranked low, as these may be excluded from the LLM’s input window.

3.: Cosine Similarity Gap evaluates the semantic separability between relevant ( $P^{+}$ ) and non-relevant ( $P^{-}$ ) passages in the embedding space:

Δ_{c o s} = \frac{1}{|P^{+}|} \sum_{p \in P^{+}} \cos (v_{q}, v_{p}) - \frac{1}{|P^{-}|} \sum_{p \in P^{-}} \cos (v_{q}, v_{p}),

(3)

where

$P^{+}$ and $P^{-}$ denote the sets of relevant and non-relevant passages, respectively,
$v_{q}$ and $v_{p}$ are the embedding vectors of query $q$ and passage $p$ ,
$\cos (u, v) = \frac{u \cdot v}{| u | | v |}$ is the cosine similarity.

By combining these metrics, we can evaluate not only the retriever’s ability to return relevant content but also the efficiency and prominence with which such content appears in the ranked list—factors that directly influence downstream QA accuracy within the RAG framework.

3.2. LLM Evaluation

The aim is to systematically evaluate the generator component’s core capabilities under controlled conditions before end-to-end system comparison. Building on our prior work [12], we first assess fundamental abilities including knowledge-gap detection (2.1) and external-truth integration during conflicting knowledge (2.2). The third experiment (2.3) specifically quantifies the context-dependence of question answering by comparing: (a) a zero-context baseline where the generator answers from parametric knowledge alone, against (b) an ideal RAG condition using gold-standard passages that contain correct answers.

3.2.1. Knowledge Gap Detection

The model’s capacity is evaluated to recognize and properly abstain from answering questions that cannot be resolved given the context provided. The evaluation employs a carefully constructed set of unanswerable 200 question-context pairs.

Models are explicitly instructed to respond with the abstention token “Жауап табылмады” (“No answer found”) when the question cannot be answered from the given context. To assess the impact of instruction design on model behavior, three prompt styles were evaluated in English and Kazakh, differing in directive tone and constraints on hallucination. The corresponding templates for each style are shown in Table 2.

Performance is assessed through two complementary metrics:

Hallucination Rate (HR) that measures the proportion of unanswerable items for which the model produces a substantive response instead of abstaining.
Abstention Rate (AR) captures the proportion of items for which the model outputs the designated abstention token, thereby indicating correct recognition of the knowledge gap.

This experiment measures the model’s self-awareness in rejecting questions beyond its knowledge scope, establishing a baseline for subsequent retrieval-augmented experiments.

3.2.2. External Truth Integration

The model’s capacity to prioritize external evidence over internal parametric knowledge in the presence of factual contradictions is evaluated. The experiment employs 100 modified question-answer-context triples, where provided contexts deliberately contradict general facts stored in the model’s internal knowledge. Models are instructed to answer solely based on the given context. The instruction used for this evaluation is presented in Table 3.

To quantify the model’s performance, we employ the following metrics:

Contextual Agreement Rate (CAR): The primary metric, defined as the proportion of cases where the model’s output agrees with the fact presented in the external context (context_answer).
Parametric Override Rate (POR): The proportion of cases where the model ignores the provided context and instead generates an answer based on its internal knowledge (param_answer), thereby overriding the external evidence.
Other Error Rate (OER): The proportion of answers that do not match either the context_answer or the param_answer. This metric captures hallucinations, off-topic responses, or other types of errors.

A high CAR value indicates strong adherence to context and robustness against model’s parametric knowledge, which is desirable for context-based reasoning in RAG. Additionally, we report the average response time in milliseconds (ms) to assess the computational efficiency of the model during this task.

3.2.3. Ideal RAG vs. Zero-Shot Learning

Existing studies mainly examine retrieval or overall system quality, assuming perfect retrieval ensures correct generation. This assumption, however, warrants closer examination. A two-phase evaluation is implemented leveraging the manually curated dataset of context-question-answer triples:

Zero-Shot Learning [47]: The model answers questions without access to any training examples or external context, establishing a closed-book baseline reflecting its intrinsic parametric knowledge.
Ideal RAG: The same questions are presented with verified ground-truth passages, simulating perfect retrieval.

By contrasting the results of these two configurations, the study quantifies the effect of providing perfect retrieval context and determines whether access to the correct supporting passage improves the generator’s accuracy. For the RAG setting, the standardized prompt template illustrated in Table 4 was used with Kazakh and English prompt-language variants.

To assess the performance, both syntactic and semantic evaluation metrics were adopted. For syntactic evaluation, ROUGE-L [48] and METEOR [49] are used, which measure n-gram overlap between generated and reference answers. While these metrics provide insights into textual alignment, they do not fully capture semantic correctness. For deeper semantic assessment, Ragas is utilized [50], LLM-based evaluation framework, which allows to recognize paraphrased or lexically divergent but semantically equivalent answers. Within the framework, Answer correctness metric is assessed by combining factual and semantic similarity between the generated response and the ground-truth answer. Specifically, factual correctness considers True Positives (TP) as facts present in both the ground truth and generated answer, False Positives (FP) as facts present only in the generated answer, and False Negatives (FN) as facts missing from the generated answer but present in the ground truth. The F1 score that ensures a balanced evaluation of precision and recall is then calculated as:

F 1 = \frac{| T P |}{| T P | + 0.5 \times (| F P | + | F N |)}

(4)

Using embedding-based cosine similarity, this metric captures conceptual equivalence, rewarding correct paraphrased answers. The final answer correctness varying from 0 to 1 is obtained as a weighted average of the factual F1 score and semantic similarity, providing a comprehensive measure of both factual alignment and linguistic variation.

In addition to factual accuracy, linguistic robustness will be evaluated with particular attention to Kazakh-specific morphological characteristics. Some models may produce outputs that contain the correct fact but are noisy, overly verbose, or exhibit code-switching (inserting non-Kazakh tokens). Such issues do not always affect factual correctness but degrade practical usability and linguistic fluency.

While some models generate factually correct responses, their outputs may suffer from linguistic noise, excessive verbosity, or code-switching (i.e., insertion of non-Kazakh tokens). Although such issues do not necessarily compromise factual accuracy, they significantly impair practical usability. To capture this dimension, we also introduce the Generation Quality (GQ) metric, which penalizes linguistic noise and excessive length. Specifically, it is computed as:

G Q = 1 - (0.6 \times N o i s e R a t e + 0.4 \times L e n g t h P e n a l t y),

(5)

where NoiseRate is the proportion of non-Kazakh characters in the output, and LengthPenalty is a normalized measure of verbosity.

In addition to automatic scoring, human-in-the-loop analysis also serves to evaluate the applicability of the RAGAS [50] framework.

3.3. RAG vs. Closed-Book QA Evaluation

The objective of this experiment is to evaluate the effectiveness of RAG for Kazakh question-answering by focusing on the best-performing open-source model identified in prior evaluations. We examine whether a Kazakh-adapted open-source model can achieve performance levels comparable to state-of-the-art proprietary systems when equipped with retrieval support. Performance gains attributable to retrieval augmentation are quantified under realistic conditions, in which retrieved passages may be incomplete, noisy, or only partially relevant.

A total of 72,315 text chunks were indexed to construct the retrieval database. All vector representations were L2-normalized, which allows cosine similarity to be computed as a simple dot product, improving both efficiency and stability. Indexing was performed with FAISS, applying a similarity threshold of 0.35 to filter low-relevance results. The prompting strategy enforced a strict grounding policy: the model was instructed to generate a concise, one-sentence answer in Kazakh exclusively based on the provided passages. If sufficient evidence was not present, the system was required to output the fallback phrase “No answer found”.

4. Models and Datasets

Two distinct model groups were evaluated corresponding to the retriever and generator components, as illustrated in Figure 3.

4.1. Models

4.1.1. Embedding Models Evaluated

Six embedding models (see Table 5) were evaluated and compared against BM25 baseline [51].

4.1.2. LLMs Evaluated

Five LLMs were evaluated in both closed-book and RAG configurations to assess their ability to answer questions in Kazakh, as summarized in Table 6.

4.2. Datasets

Retriever performance was evaluated using the ISSAI/kazqad-retrieval dataset, a curated Kazakh question–passage benchmark derived from the Kazakh Question Answering Dataset [42]. The kazqad-retrieval split contains queries paired with one or more gold-standard relevant passages, enabling precise measurement of retrieval quality. This dataset was chosen for its alignment with the distribution, linguistic features, and complexity of questions in the QA experiments, ensuring methodological consistency between retrieval benchmarking and generation evaluation.

To simulate realistic retrieval conditions and evaluate the effectiveness of real-world end-to-end RAG system under imperfect inputs (as discussed in Section 3.3), we constructed a comprehensive context dataset consisting of 72,315 text chunks, which were subsequently indexed into FAISS.

To provide retrieval context, passages from multiple sources were incorporated:

KazQAD [60]— A large-scale English QA dataset consisting of more than 68,000 question–context–answer triples. The Natural Questions subset translated into Kazakh is additionally used.
Arailym-tleubayeva/small_kazakh_corpus [61]—a collection of short Kazakh texts designed for semantic search, which enriches the retriever with varied linguistic structures.
MBZUAI/KazMMLU (Kazakh_History subcorpus) [62]—a dataset of factual questions in Kazakh, targeting historical and culturally specific domains.
Kyrmasch/sKQuAD [63]—a Kazakh QA dataset with 1000 annotated records.

4.2.1. Testbeds

To enable a comprehensive and linguistically grounded evaluation of large language models (LLMs) for Kazakh question answering (QA), three complementary testbeds were developed (Figure 4). Each targets a specific reasoning dimension within the RAG evaluation framework.

Testbed 1: Knowledge Gap Detection. This subset evaluates the model’s ability to identify questions that cannot be answered from the provided context and to appropriately abstain. It comprises 200 unanswerable question–context pairs.

Testbed 2: External Truth Integration. This testbed assesses whether models prioritize external evidence over parametric (internal) knowledge when contradictions occur. It includes 100 question–context–answer triples where the supplied context deliberately conflicts with general knowledge (e.g., stating that the capital of Ireland is Paris). Correct model behavior requires rejecting its internal memory and grounding the answer on the retrieved text.

Testbed 3: Context–Question–Answer. This is the main evaluation corpus that consists of 1254 context–question–answer triples (1129 factoid and 125 definitional questions [64]) paired with supporting contexts. This dataset is used for the ideal RAG vs. Zero-shot experiments and downstream closed-book comparisons. Final dataset statistics by the main domain distribution are presented below in Figure 5.

4.2.2. Testbeds Creation Process

Overall dataset creation process followed a structured four-stage workflow designed for transparency and reproducibility (Figure 6). The process integrates multi-source data collection, human annotation, and consensus-based validation.

The scope, domains, and objectives of the experiments were first defined, guiding the design of testbeds and informing subsequent data collection and labeling strategies. Data was drawn from both raw Kazakh corpora and publicly available question–answer datasets. Following data acquisition, a detailed annotation manual was developed. Two native-speaker annotators were then trained and provided with calibration examples to ensure consistency in labeling. Furthermore, annotators independently labeled each: inter-annotator agreement (IAA) was computed using weighted Cohen’s κ for ordinal ratings and Krippendorff’s α for nominal categories (Table 7). In cases of disagreement or contradictory content, items were discarded. Finally, consensus was reached on all remaining data. Further information on the IAA is available on our GitHub repository (see Data Availability Statement). IAA values were interpreted according to established reliability standards: scores of ≥0.80 were considered strong agreement, 0.67–0.79 substantial agreement, and values below 0.67 moderate or lower reliability.

Following the inter-annotator evaluation, the datasets underwent an additional refinement stage to resolve residual discrepancies, clarify ambiguous instances, and harmonize label definitions across all testbeds prior to final release.

5. Results

5.1. Results of Retriever Evaluation

Multiple sparse and dense retrievers were evaluated on the ISSAI/kazqad-retrieval dataset to identify the most effective model for Kazakh question answering [42].

As shown in Table 8, dense multilingual encoders markedly outperform BM25 across all recall levels. BGE-M3 achieves the highest R@1 = 0.641, R@10 = 0.929, and MRR = 0.746, representing a sizable gain over BM25 (R@1 = 0.489, R@10 = 0.807, MRR = 0.591). Snowflake Arctic-Embed ranks a close second (R@1 = 0.591, R@10 = 0.907) and shows the largest cosine-margin separation (ΔCos ≈ 0.099), indicating stronger discrimination between relevant and non-relevant passages. E5-base, encoded with the recommended query:/passage: prefixes, improves upon BM25 (R@1 = 0.562, R@10 = 0.859) but lags behind BGE and Snowflake. LaBSE and OpenAI/text-embedding-3-large perform weakest on this Kazakh-centric benchmark (R@1 = 0.365 and 0.323, respectively).

Overall, BGE-M3 emerges as the most effective retriever, with Snowflake Arctic-Embed providing a strong alternative when a wider relevance margin (ΔCos) is desirable.

Figure 7 illustrates these trends visually, emphasizing the performance gap between modern dense retrieval approaches and older embedding or sparse methods.

5.2. Closed-Book LLM Evaluation and Model Selection for RAG

5.2.1. Results of Knowledge Gap Detection

In this experiment, models were evaluated for their capacity to detect when a question was unanswerable from the supplied evidence and to abstain by emitting the exact token “Жауап табылмады” (“No answer found”). On Kazakh prompts (see Table 9), GPT-4o demonstrated pronounced gap sensitivity (AR ≈ 0.90; HR ≈ 0.10) with negligible over-explanation and sub-second latency, whereas KazLLM-8B showed partial compliance (AR ≈ 0.38) but frequent token-plus-verbiage violations; Gemini-2.5-Flash, Sherkala-8B, and Irbis-7B largely failed to abstain.

These patterns across AR/HR and token exactness are visualized in Figure 8, while the corresponding latency profile—critical for interactive QA—is summarized in Figure 9, where GPT-4o is fastest, Gemini-2.5-Flash is moderate, KazLLM-8B is slower, and Irbis-7B and especially Sherkala-8B incur the largest delays.

English prompts (see Table 10) replicated this ranking (GPT-4o AR ≈ 0.92; KazLLM-8B AR ≈ 0.46). Aggregated over Base/AbstainFirst/Strict styles, stricter wording benefited only those models already exhibiting gap awareness, indicating that abstention fidelity is predominantly model-intrinsic rather than prompt-induced.

The metric profiles are visualized in Figure 10, and the corresponding latency characteristics—relevant for interactive QA—are summarized in Figure 11. Collectively, these findings establish disciplined abstention as a necessary condition for closed-book reliability and a prerequisite for meaningful RAG gains.

5.2.2. Results of External Truth Integration

The results presented in Table 11 reveal a consistent inability of open-source Kazakh fine-tuned LLMs to suppress their parametric knowledge when confronted with contradictory contextual information. Across all runs, these models achieved an almost negligible Contextual Agreement Rate (CAR), indicating a failure to prioritize external evidence over memorized facts. Instead, they predominantly relied on internal representations, resulting in uniformly high Other Error Rates (OER ≥ 0.83) that reflect irrelevant, incomplete, or hallucinatory outputs.

Among domestic models, KazLLM-8B achieved a minimal yet non-zero CAR (0.003 ± 0.006) together with a moderate Parametric Override Rate (POR = 0.160 ± 0.020). They suggest partial activation of internal factual memory even when explicitly instructed to rely solely on the provided context. Irbis-7B-v0.1 exhibited a lower POR (0.010 ± 0.014); however, its overall accuracy remained negligible, as evidenced by an exceptionally high OER = 0.990 ± 0.014, indicating a near-complete absence of context-sensitive reasoning. Sherkala-8B-Chat showed a similar behavioral pattern (CAR = 0.000; POR = 0.030 ± 0.010; OER = 0.970 ± 0.010) while maintaining an average latency of approximately 2.3 s ± 0.04 s per query, which, although faster than earlier iterations, still limits its suitability for real-time applications.

In contrast, GPT-4o was the only system to achieve a clearly non-zero CAR (0.010 ± 0.000), demonstrating a measurable—though still limited—capacity to prioritize contextual truth over internal bias. Nevertheless, its relatively high POR = 0.243 ± 0.015 indicates that it still frequently defaults to stored parametric knowledge in the presence of factual contradictions. Gemini-2.5-Flash, meanwhile, produced zero CAR and POR values, coupled with a highly variable OER = 0.333 ± 0.471 and moderately variable response times (1 370 ± 66 ms), reflecting an overly cautious adherence to task instructions rather than inconsistent performance.

Overall, these findings highlight a persistent context-alignment gap between frontier-scale and locally fine-tuned Kazakh models. None of the open-source systems demonstrated robust contextual obedience, emphasizing that the mitigation of parametric bias remains a central unsolved challenge for context-grounded reasoning in low-resource language settings.

Figure 12 compares the performance of the five evaluated models under deliberately contradictory contextual evidence. The visualization highlights the structural limitations of Kazakh-centric open-source LLMs, which consistently revert to their internal parametric knowledge, yielding near-zero CAR values and uniformly high OER (≥0.83). Among them, KazLLM-8B demonstrates the highest relative contextual sensitivity (CAR ≈ 0.003; POR ≈ 0.16), outperforming Irbis-7B and Sherkala-8B, yet still far below the threshold required for reliable context-based reasoning. GPT-4o achieves the best overall balance, with the highest CAR (≈0.01) and lowest OER (≈0.75), reflecting partial but measurable alignment with contextual truth. Gemini-2.5-Flash, though computationally efficient, shows large variance in OER and latency, indicating instability across runs. Taken together, these results illustrate the continuing challenge of achieving consistent and trustworthy context grounding in multilingual, low-resource LLMs.

In addition to accuracy-based metrics, the computational effectiveness of each model was evaluated by measuring the average response latency (Figure 13). KazLLM-8B and GPT-4o demonstrated the fastest generation times, averaging approximately 0.7–0.8 s per query, confirming their suitability for interactive or near-real-time RAG pipelines. Gemini-2.5-Flash and Irbis-7B-v0.1 showed moderate delays (≈0.9 s and 2.5 s, respectively), while Sherkala-8B-Chat exhibited the slowest performance (≈2.3 s ± 0.04 s), making it less suitable for time-sensitive applications.

5.2.3. Results of Ideal RAG vs. Zero-Shot Learning

Table 12 presents the Zero-shot learning performance, which reflects the intrinsic parametric knowledge of the evaluated models without access to any external context. The results show that large proprietary models, GPT-4o and Gemini-2.5-flash, substantially outperform all Kazakh-centric open models, achieving average Answer correctness scores of 0.60 and 0.59 under Kazakh prompts, and 0.59 and 0.55 under English prompts, respectively. In contrast, Kazakh-specific LLMs such as KazLLM-8B, Irbis-7B-v0.1, and Sherkala-8B-Chat exhibit considerably lower correctness scores (ranging between 0.31 and 0.43 across both prompt languages), highlighting their limited factual coverage and reasoning ability when relying solely on internal knowledge. Prompt language does not consistently affect performance, although Sherkala-8B-Chat shows slightly better results under Kazakh prompts (0.38) compared to English (0.31).

Table 13 reports the results under the Ideal RAG setting, where all models are provided with the gold-standard retrieved passages. This setting reveals the upper bound of performance when models are given complete and relevant external context. All models benefit substantially from retrieval, with KazLLM-8B achieving an average correctness of 0.86–0.87 across languages, outperforming its zero-shot baseline by over 0.40 points. GPT-4o and Gemini-2.5-flash maintain their leading positions (0.87–0.89), but the performance gap between proprietary and Kazakh-centric models narrows considerably, indicating that retrieval can compensate for limited internal knowledge. Interestingly, Sherkala-8B-Chat shows a pronounced improvement under Kazakh prompts (0.73) compared to English (0.63), suggesting its retrieval-augmented reasoning is more effective in the Kazakh language. For each metric, the mean, standard deviation, and 95% confidence intervals were reported, estimated using non-parametric bootstrapping (n = 10,000). All pairwise comparisons between models were statistically significant at p < 0.05 (see Appendix A.1).

Figure 14 plots Generation Quality (GQ) against Answer Correctness for all models in both Zero-shot (Figure 14a) and Ideal RAG (Figure 14b) configurations. In the Zero-shot setting, GPT-4o and Gemini-2.5-flash achieve the highest performance, with near-perfect GQ (0.95–0.98) and correctness around 0.59–0.60, clearly leading all other models. Among the Kazakh-centric models, KazLLM-8B performs best, reaching correctness of 0.42–0.43 with relatively strong GQ (0.82–0.84). Sherkala-8B-Chat shows moderate correctness under Kazakh prompts (0.38) but lower GQ (0.68), while its English performance is weaker on both dimensions. Irbis-7B-v0.1 exhibits slightly lower correctness (0.34–0.35) but maintains decent GQ (≈0.80).

Under the Ideal RAG configuration, proprietary models remain on top, with GPT-4o and Gemini achieving correctness of 0.87–0.89 and GQ above 0.95. However, KazLLM-8B closes much of the gap, reaching correctness of 0.86–0.87 with strong GQ (0.94–0.96). Sherkala-8B-Chat shows a notable jump in Kazakh (0.73 correctness), though its English performance remains lower (0.63). Irbis-7B-v0.1 also benefits from retrieval, increasing to 0.71–0.73 correctness with moderate GQ gains. These trends highlight the strong impact of providing relevant evidence, especially for Kazakh-centric models.

Table 14 highlights typical answer quality failure modes such as verbosity and language switching, which are common in Zero-shot outputs of open models but substantially reduced with applied retrieval augmentation.

5.3. End-to-End RAG Evaluation

This experiment evaluates the full retrieval-augmented generation pipeline by measuring how different retriever–generator combinations affect question answering accuracy in Kazakh. The goal is to determine how retrieval quality affects downstream performance and whether higher relevance yields better answers.

To explore this, two distinct RAG configurations were analyzed:

A baseline setup using GPT-4o as the generator with text-embedding-3-large as the retriever. Although both GPT-4o and Gemini-2.5-flash demonstrated strong reliability in controlled experiments, GPT-4o was selected as the primary generator for the end-to-end RAG pipeline due to its favorable balance of cost, stability, and integration. GPT-4o provided predictable token-based pricing, low latency, and consistently cleaner outputs, which minimized evaluation bias.
A pipeline using open-source KazLLM-8B as the generator paired with other dense-embedding models.

Results are reported for four retrieval settings with varying candidate document amount (top-1, 3, 5, and 10 retrieved passages). Table 15 summarizes the results for GPT-based QA system. The model consistently produced stable, accurate, and grounded answers in Experiment 2. However, its retriever (text-embedding-3-large) performs significantly worse than open-source alternatives.

Even with a high-performing generator, the pipeline achieved a maximum answer correctness of only ≈0.65 at best (while Ideal RAG score was 88). This confirms that a weak retriever can severely limit RAG performance, regardless of generation quality.

Table 16 shows results for KazLLM-8B combined with various retrievers. Overall, strong retrievers significantly improve downstream QA, although the correlation is not strictly monotonic. The BAAI/bge-m3 retriever demonstrates solid performance across all ranks, reaching 0.71 answer correctness at Top-5, with stable recall values increasing from 0.58 at Top-1 to 0.65 at Top-10. Interestingly, Snowflake-arctic achieves the highest overall QA accuracy, despite having slightly lower recall scores. It records 0.745 correctness at Top-1, 0.76 at Top-5, and maintains competitive performance at Top-10 (0.725). This demonstrates that higher retrieval scores do not necessarily guarantee superior QA performance, and that embedding quality can play a critical role in how retrieved passages are leveraged by the generator.

In contrast, intfloat/multilingual-e5-base performs consistently below the top two retrievers, with Top-5 answer correctness reaching only 0.56. The LaBSE model exhibits the weakest performance overall, with Top-5 correctness below 0.40, reflecting its limited retrieval and alignment capabilities. As summarized in Appendix A.2, all pairwise comparisons between retrievers at the Top-5 setting were statistically significant (paired t-test, p < 0.05).

Across all models, a non-linear trend emerges increasing the number of retrieved passages improves performance up to Top-5, but further expansion to Top-10 tends to introduce noise and reduce answer correctness. For example, bge-m3 increases from 0.70 at Top-1 to 0.71 at Top-5, but then declines to 0.69 at Top-10. Snowflake follows a similar trajectory, peaking at 0.76 at Top-5 before slightly dropping at Top-10. This pattern indicates that retrieving more passages beyond an optimal threshold can dilute relevance signals, underscoring the importance of balanced retrieval strategies.

6. Error Analysis

Error analysis on 100 randomly sampled failure cases drawn from the top 3, 5, and 10 retrieval configurations was performed to better understand why retrieval metrics did not consistently predict end-to-end QA performance. The analysis focuses on questions where the correct supporting passage was retrieved, yet the final answer remained incorrect, thereby isolating errors arising from the generation stage rather than retrieval. Most errors fall into four core categories. Table 17 summarizes representative examples for each error type.

Figure 15 compares the relative frequency of the four error types across the 100 analyzed answers. The most common failure type was Generation Errors despite correct retrieval, where the correct answer was present in the retrieved passages but the model either produced unrelated text or abstained from answering. This indicates weaknesses in evidence selection and grounding rather than retrieval coverage. Code-switching errors were the second most frequent category. The model often mixed Cyrillic and Latin scripts or switched to English, particularly in scientific and geographic contexts. Although factually correct, such answers were penalized by evaluation metrics sensitive to language and script mismatches. Semantic drift errors occurred when retrieved passages were thematically related but did not contain the required fact, leading to contextually plausible yet incorrect answers. Finally, granularity issues arose when the level of detail in retrieved content did not align with the question’s scope, either being too coarse or overly specific.

Moreover, the results indicate that morphological variation is not the primary source of error. Cases where the gold and predicted answers differ only in surface form (e.g., suffixation or slight morphological variants) are generally handled well by the Answer Correctness metric, which assigns high correctness scores even when the answers are not string-identical (as presented in Table 18).

Overall, these findings indicate that generation errors and language inconsistencies dominate failure cases, rather than deficiencies in retrieval or morphological complexity. Addressing code-switching behavior and improving the model’s grounding to retrieved passages may therefore yield the most significant gains in low-resource language QA performance.

7. Discussion

The results underline three key insights into RAG performance for Kazakh question answering: (1) retrieval can greatly enhance QA in low-resource settings; (2) there is a clear trade-off between added evidence and noise; and (3) high retrieval metrics do not always translate into higher QA accuracy. For example, KazLLM-8B model in a closed-book setting achieved answer correctness of only 0.427, highlighting the limitations of relying solely on parametric knowledge. However, when provided with gold-standard passages (ideal RAG scenario), performance jumped to 0.86. This dramatic improvement confirms that, while internal knowledge is insufficient, the model can effectively use external evidence when it is highly relevant.

Expanding the number of retrieved passages (top-k parameter) improves performance up to a point but is subject to diminishing returns and can introduce distracting content. For example, BAAI/bge-m3 improved from 0.745 correct answers at top-1 to 0.76 at both top-3 and top-5, then dropped to 0.725 at top-10. This pattern holds across models. The added passages often contain noise (semantically similar but irrelevant information) that dilutes true grounding.

Moreover, although retrieval metrics provide useful indicators of performance, they are not sufficient predictors of downstream QA accuracy. For example, Snowflake-arctic achieved lower retrieval scores than BAAI/bge-m3 yet matched or exceeded its QA performance under certain configurations. This finding suggests that highly ranked passages may be semantically relevant but still lack the specific information required for accurate answer generation. Such a disconnect underscores the need for retrieval systems to optimize not only for relevance but also for answer utility. These observations highlight critical considerations for the design of effective RAG pipelines.

Error analysis further revealed that common reasons for incorrect answers, despite the correct passage being retrieved, were evidence selection failures, code-switching between scripts or languages, semantic drift, and mismatched answer granularity. Morphological variation was not a primary source of error, suggesting that embedding-based evaluation metrics handle surface-form differences robustly.

Evaluating RAG rigorously is challenging because traditional retrieval metrics emphasize surface relevance rather than contribution to the final answer. RAG systems require retrieval that maximizes utility—the degree to which a passage actually helps generate a correct answer. Further, emerging practices such as LLM-based evaluation introduce issues around calibration, consistency, and alignment with human judgment. These issues, combined with high computational cost of full RAG evaluation, demand more efficient and task-specific evaluation methodologies that focus on grounding and relevance rather than traditional IR metrics alone.

The evaluation metrics considered in this work are not the only ones. There are other aspects by which the quality of QA systems can be judged, such as grammatical correctness [65,66,67], appropriateness of tone [68], or the model’s tendency to agree with the user [69,70]. In this sense, it is useful to view the evaluation framework as a set of subsystems, each assessing a specific aspect, but not necessarily directly connected to the others through a single overall quality metric. From a system-theoretic standpoint, a RAG system can be regarded as a collection of interacting subsystems, each performing its own function but linked through shared information and control flows. This framework provides a formal way to define and extend the RAG architecture.

From a system-theoretic perspective, retrieval, generation, and verification function as interconnected subsystems within an adaptive information-processing pipeline. Retrieval reduces input entropy by selecting relevant evidence; generation converts this structured input into linguistic output; and verification provides feedback regulating their interaction. Hallucinations arise from insufficient feedback, causing uncontrolled amplification of internal representations, whereas over-grounding reflects excessive damping from external noise. Within this framework, RAG performance metrics serve as indicators of system stability, underscoring the importance of feedback and control mechanisms for maintaining reliable, balanced QA performance.

8. Limitations

Several limitations must be acknowledged. First, the scope of this study is intentionally narrow, focusing primarily on factoid and definitional QA within curated datasets. As a result, the findings may not generalize to broader Kazakh NLP tasks or more interactive, dialogue-based scenarios. Second, limitations arise from the composition of the evaluation dataset, which includes a mix of native and translated Kazakh content. While this design reflects the current realities of low-resource language research, it may introduce confounding effects related to translation quality and linguistic naturalness. This limitation highlights the need for future datasets that isolate native-only content to enable more controlled comparisons. Third, this study does not directly assess grammatical, morphological, or morphosyntactic characteristics. Our focus was on factual accuracy and retrieval-augmented reasoning rather than linguistic generation quality. As such, we cannot make definitive claims regarding the grammatical correctness or structural adequacy of model outputs. Fourth, the retrievers used in this work were not fine-tuned for Kazakh, primarily due to the lack of large-scale, high-quality retrieval data. This limitation may have constrained their ability to capture language-specific patterns and optimize retrieval relevance. In addition, the heavy skew in domain distribution represents a further limitation that may affect the generalizability of the findings across more diverse topics and application contexts. Finally, although our broad evaluation across multiple proprietary and open-source models provided a comprehensive overview, it came at the expense of in-depth system-level analysis. Taken together, these factors indicate that while retrieval augmentation allows Kazakh-centric models to approach state-of-the-art performance under controlled conditions, the real-world accuracy and linguistic robustness of current systems remain insufficient for practical deployment. Furthermore, the findings should not be assumed to generalize to other Turkic or low-resource languages without further empirical validation.

9. Conclusions

Our findings provide important insights into the strengths and limitations of current LLMs for Kazakh QA. Proprietary models (GPT-4o, Gemini-2.5-flash) demonstrate consistent reliability in both abstaining from unanswerable queries and adhering to contextual truth, achieving near-optimal performance under RAG. In contrast, open-source Kazakh-focused models, despite being explicitly adapted for the language, show mixed results: KazLLM-8B benefits greatly from retrieval but remains inconsistent in knowledge gap detection, while Sherkala and Irbis display limited ability to abstain or privilege external evidence over parametric knowledge.

This study set out to evaluate how proprietary and Kazakh-oriented open-source LLMs perform on Kazakh QA, and whether retrieval augmentation can bridge the gap between them. Three main findings emerge. First, proprietary state-of-the-art models such as GPT and Gemini clearly outperform open-source Kazakh LLMs in closed-book QA. Their advantage is most visible in knowledge-gap detection, where abstention rates above 0.9 show disciplined handling of unanswerable questions, while KazLLM, Sherkala, and Irbis struggle with hallucinations. These results answer our first research question (RQ1) by confirming that closed-book performance of Kazakh-specific models still lags behind proprietary LLMs due to limited data and smaller parameter scale.

Second, our results show that retrieval augmentation substantially improves performance across all models. Under ideal RAG, even Kazakh-specific models such as KazLLM-8B approaches frontier-level correctness (0.867 vs. 0.869 for GPT). This addresses RQ2, demonstrating that RAG can compensate for limited parametric knowledge by grounding answers in external context. Beyond improving accuracy, retrieval also enhanced answer fluency and reduced hallucinations, suggesting that it serves as a stabilizing mechanism in morphologically complex languages.

Future research should focus on developing larger and more diverse Kazakh datasets to better capture the complexity of real-world information needs. It should also extend beyond factual QA to encompass interactive, multi-turn, and multimodal scenarios, ensuring that QA systems are not only accurate on benchmarks but also robust, adaptable, and practically useful in real-world applications. Fine-tuning retrievers on Kazakh corpora and exploring more advanced prompting strategies represent important next steps for improving relevance ranking and the overall quality of retrieved evidence. Furthermore, extending the evaluation framework to other Turkic languages would help assess how well the proposed approaches transfer across related linguistic contexts. Strengthening context-grounding mechanisms and mitigating common generation issues, such as code-switching and inconsistent language use, remain essential for enhancing output reliability. Further studies should also conduct more fine-grained, diagnostic evaluations on a smaller number of systems to better understand the linguistic and architectural factors underlying model success or failure. Finally, an important complementary direction is to strengthen the theoretical underpinnings of retrieval-augmented QA. Formalizing RAG pipelines as cybernetic systems, using entropy-based measures and feedback-control mechanisms, could enable the development of self-regulating, adaptive QA architectures that maintain stability between internal model knowledge and external evidence.

Author Contributions

Conceptualization, A.M.; Data curation, A.T.; Formal analysis, A.T.; Funding acquisition, A.S.; Investigation, A.T.; Methodology, A.M.; Project administration, A.S.; Supervision, A.N., A.S. and S.E.S.; Validation, A.M. and A.T.; Visualization, A.M. and A.T.; Writing—original draft, A.M. and A.T.; Writing—review, A.M. and A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan under the program for grant financing of young scientists for scientific and/or scientific-technical projects for the years 2024–2026 No. AP22787410.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data, code, and configuration files presented in this study are available in GitHub at https://github.com/Arailym-ray/KAZ-QA-RAG (accessed on 20 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
AR	Abstention Rate
CAR	Contextual Agreement Rate
CI	Confidence Interval
ETI	External Truth Integration
FAISS	Facebook AI Similarity Search
FP	False Positive
FN	False Negative
GQ	Generation Quality
HR	Hallucination Rate
IAA	Inter-Annotator Agreement
IR	Information Retrieval
ISSAI	Institute of Smart Systems and Artificial Intelligence (Nazarbayev University)
LLM	Large Language Model
MBZUAI	Mohamed bin Zayed University of Artificial Intelligence
METEOR	Metric for Evaluation of Translation with Explicit ORdering
MRR	Mean Reciprocal Rank
NLP	Natural Language Processing
OER	Other Error Rate
POR	Parametric Override Rate
QA	Question Answering
RAG	Retrieval-Augmented Generation
RAGAS	Retrieval-Augmented Generation Assessment
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)
SD	Standard Deviation
TP	True Positive

Appendix A

Appendix A.1

The table below summarizes the statistical significance results (paired t-test p-value) on Answer Correctness metric for Zero-shot and Ideal RAG settings.

Table A1. Pairwise statistical significance of Zero-shot and Ideal RAG settings (p-values, paired t-test on Answer Correctness metric).

Model 1	Model 2	Zero-Shot		Ideal RAG
Model 1	Model 2	EN	KK	EN	KK
GPT-4o	KazLLM-8B	0.000	0.028	0.03	0.000
GPT-4o	Sherkala-8B	0.000	0.000	0.000	0.000
GPT-4o	Irbis-7B	0.000	0.000	0.000	0.0016
GPT-4o	Gemini-2.5	0.004	0.0195	0.0017	0.0002
KazLLM-8B	Sherkala-8B	0.0001	0.000	0.000	0.000
KazLLM-8B	Irbis-7B	0.000	0.000	0.000	0.000
KazLLM-8B	Gemini-2.5	0.000	0.000	0.0013	0.047
Sherkala-8B	Irbis-7B	0.0034	0.0064	0.003	0.004
Sherkala-8B	Gemini-2.5	0.000	0.000	0.000	0.000
Irbis-7B	Gemini-2.5	0.000	0.000	0.000	0.000

Appendix A.2

Table below shows comparative RAG results (KazLLM-8B as generator) for Top-5 answer-correctness scores and corresponding paired t-test p-values across retrievers.

Table A2. Pairwise statistical significance across retrievers (p-values, paired t-test on Answer Correctness metric).

Retriever A	Retriever B	Paired p-Value
BGE-m3	Snowflake-Arctic	0.000
BGE-m3	E5-base	0.000
BGE-m3	LaBSE	0.000
Snowflake-Arctic	E5-base	0.0001
Snowflake-Arctic	LaBSE	0.000
E5-base	LaBSE	0.0002

References

Jiang, S.; Xie, X.; Tang, R.; Wang, X.; Sun, K.; Li, G.; Xu, Z.; Xue, P.; Li, Z.; Fu, X. ARGUS: Retrieval-Augmented QA System for Government Services. Electronics 2025, 14, 2445. [Google Scholar] [CrossRef]
Jiang, F.; Qin, C.; Yao, K.; Fang, C.; Zhuang, F.; Zhu, H.; Xiong, H. Enhancing Question Answering for Enterprise Knowledge Bases Using Large Language Models. In Proceedings of the International Conference on Database Systems for Advanced Applications, Singapore, 8–11 July 2024; Springer Nature: Singapore, 2024; pp. 273–290. [Google Scholar]
Noy, S.; Zhang, W. Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence. Science 2023, 381, 187–192. [Google Scholar] [CrossRef] [PubMed]
Khowaja, S.A.; Khuwaja, P.; Dev, K.; Wang, W.; Nkenyereye, L. ChatGPT Needs SPADE (Sustainability, Privacy, Digital Divide, and Ethics) Evaluation: A Review. Cogn. Comput. 2024, 16, 2528–2550. [Google Scholar] [CrossRef]
Veitsman, Y.; Hartmann, M. Recent Advancements and Challenges of Turkic Central Asian Language Processing. arXiv 2024, arXiv:2407.05006. [Google Scholar] [CrossRef]
WorldData.info. Spread of the Kazakh Language. Total Native Speakers: Approximately 15.3 Million, Including 13.3 Million in Kazakhstan. Available online: https://www.worlddata.info/languages/kazakh.php (accessed on 12 August 2025).
Koto, F.; Joshi, R.; Mukhituly, N.; Wang, Y.; Xie, Z.; Pal, R.; Orel, D.; Mullah, P.; Turmakhan, D.; Goloburda, M.; et al. Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh. arXiv 2025, arXiv:2503.01493. [Google Scholar]
Institute of Smart Systems and Artificial Intelligence (ISSAI), Nazarbayev University. (n.d.). *LLama-3.1-KazLLM-1.0-8B* [Large Language Model]. Hugging Face. Available online: https://huggingface.co/issai/LLama-3.1-KazLLM-1.0-8B (accessed on 10 August 2025).
Astana Hub. AlemLLM [Large Language Model]. Hugging Face. Available online: https://huggingface.co/astanahub/alemllm (accessed on 10 August 2025).
Kadyrbek, N.; Tuimebayev, Z.; Mansurova, M.; Viegas, V. The Development of Small-Scale Language Models for Low-Resource Languages, with a Focus on Kazakh and Direct Preference Optimization. Big Data Cogn. Comput. 2025, 9, 137. [Google Scholar] [CrossRef]
Zaib, M.; Zhang, W.E.; Sheng, Q.Z.; Mahmood, A.; Zhang, Y. Conversational question answering: A survey. Knowl. Inf. Syst. 2022, 64, 3151–3195. [Google Scholar] [CrossRef]
Alkhaldi, T.Y.S. Studies on Question Answering in Open-Book and Closed-Book Settings. Ph.D. Thesis, Kyoto University, Kyoto, Japan, 2023. [Google Scholar]
Wang, C.; Liu, P.; Zhang, Y. Can generative pre-trained language models serve as knowledge bases for closed-book QA? arXiv 2021, arXiv:2106.01561. [Google Scholar]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Kiela, D.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H.; et al. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar] [CrossRef]
Mansurova, A.; Mansurova, A.; Nugumanova, A. QA-RAG: Exploring LLM Reliance on External Knowledge. Big Data Cogn. Comput. 2024, 8, 115. [Google Scholar] [CrossRef]
Yu, S.; Kim, G.; Kang, S. Context and Layers in Harmony: A Unified Strategy for Mitigating LLM Hallucinations. Mathematics 2025, 13, 1831. [Google Scholar] [CrossRef]
Lee, M. A mathematical investigation of hallucination and creativity in GPT models. Mathematics 2023, 11, 2320. [Google Scholar] [CrossRef]
Mohammed, M.N.; Al Dallal, A.; Emad, M.; Emran, A.Q.; Al Qaidoom, M. A comparative analysis of artificial hallucinations in GPT-3.5 and GPT-4: Insights into AI progress and challenges. In Business Sustainability with Artificial Intelligence (AI): Challenges and Opportunities; Springer: Berlin/Heidelberg, Germany, 2024; Volume 2, pp. 197–203. [Google Scholar]
Li, J.; Yuan, Y.; Zhang, Z. Enhancing llm factual accuracy with rag to counter hallucinations: A case study on domain-specific queries in private knowledge-bases. arXiv 2024, arXiv:2403.10446. [Google Scholar] [CrossRef]
Patel, N.; Mouratidis, H.; Zhi, K.N.K. LLM-Based Automated Hallucination Detection in Multilingual Customer Service RAG Applications. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Limassol, Cyprus, 26–29 June 2025; Springer Nature: Cham, Switzerland, 2025; pp. 360–373. [Google Scholar]
Pingua, B.; Sahoo, A.; Kandpal, M.; Murmu, D.; Rautaray, J.; Barik, R.K.; Saikia, M.J. Medical LLMs: Fine-Tuning vs. Retrieval-Augmented Generation. Bioengineering 2025, 12, 687. [Google Scholar] [CrossRef]
Lakatos, R.; Pollner, P.; Hajdu, A.; Joó, T. Investigating the Performance of Retrieval-Augmented Generation and Domain-Specific Fine-Tuning for the Development of AI-Driven Knowledge-Based Systems. Mach. Learn. Knowl. Extr. 2025, 7, 15. [Google Scholar] [CrossRef]
Guțu, B.M.; Popescu, N. Exploring Data Analysis Methods in Generative Models: From Fine-Tuning to RAG Implementation. Computers 2024, 13, 327. [Google Scholar] [CrossRef]
Papageorgiou, G.; Sarlis, V.; Maragoudakis, M.; Tjortjis, C. Hybrid Multi-Agent GraphRAG for E-Government: Towards a Trustworthy AI Assistant. Appl. Sci. 2025, 15, 6315. [Google Scholar] [CrossRef]
Darwish, A.M.; Rashed, E.A.; Khoriba, G. Mitigating LLM Hallucinations Using a Multi-Agent Framework. Information 2025, 16, 517. [Google Scholar] [CrossRef]
Knollmeyer, S.; Caymazer, O.; Grossmann, D. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain. Electronics 2025, 14, 2102. [Google Scholar] [CrossRef]
Wagenpfeil, S. Multimedia Graph Codes for Fast and Semantic Retrieval-Augmented Generation. Electronics 2025, 14, 2472. [Google Scholar] [CrossRef]
Zhang, G.; Xu, Z.; Jin, Q.; Chen, F.; Fang, Y.; Liu, Y.; Rousseau, J.F.; Xu, Z.; Lu, Z.; Weng, C.; et al. Leveraging long context in retrieval augmented language models for medical question answering. npj Digit. Med. 2025, 8, 239. [Google Scholar] [CrossRef]
Jiang, Z.; Ma, X.; Chen, W. Longrag: Enhancing retrieval-augmented generation with long-context llms. arXiv 2024, arXiv:2406.15319. [Google Scholar]
Lee, J.; Ahn, S.; Kim, D.; Kim, D. Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval. Autom. Constr. 2024, 168, 105846. [Google Scholar] [CrossRef]
Erak, O.; Alabbasi, N.; Alhussein, O.; Lotfi, I.; Hussein, A.; Muhaidat, S.; Debbah, M. Leveraging fine-tuned retrieval-augmented generation with long-context support: For 3GPP standards. arXiv 2024, arXiv:2408.11775. [Google Scholar]
Alawwad, H.A.; Alhothali, A.; Naseem, U.; Alkhathlan, A.; Jamal, A. Enhancing textual textbook question answering with large language models and retrieval augmented generation. Pattern Recognit. 2025, 162, 111332. [Google Scholar] [CrossRef]
Soudani, H.; Kanoulas, E.; Hasibi, F. Fine tuning vs. retrieval augmented generation for less popular knowledge. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, Tokyo, Japan, 6–9 December 2024; pp. 12–22. [Google Scholar]
Byun, J.; Kim, B.; Cha, K.-A.; Lee, E. Design and Implementation of an Interactive Question-Answering System with Retrieval-Augmented Generation for Personalized Databases. Appl. Sci. 2024, 14, 7995. [Google Scholar] [CrossRef]
Shymbayev, M.; Alimzhanov, Y. Extractive question answering for Kazakh language. In Proceedings of the 2023 IEEE International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 4–6 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 401–405. [Google Scholar]
Mukanova, A.; Barlybayev, A.; Nazyrova, A.; Kussepova, L.; Matkarimov, B.; Abdikalyk, G. Development of a Geographical Question-Answering System in the Kazakh Language. IEEE Access 2024, 12, 105460–105469. [Google Scholar] [CrossRef]
Tleubayeva, A.; Shomanov, A. Comparative analysis of multilingual QA models and their adaptation to the Kazakh language. Sci. J. Astana IT Univ. 2024, 19, 89–97. [Google Scholar] [CrossRef]
Nugumanova, A.; Apayev, K.; Saken, A.; Quandyq, S.; Mansurova, A.; Kamiluly, A. Developing a Kazakh question-answering model: Standing on the shoulders of multilingual giants. In Proceedings of the 2024 IEEE 4th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 15–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 600–605. [Google Scholar]
Zheng, H.; Shen, L.; Tang, A.; Luo, Y.; Hu, H.; Du, B.; Wen, Y.; Tao, D. Learning from models beyond fine-tuning. Nat. Mach. Intell. 2025, 7, 6–17. [Google Scholar] [CrossRef]
Maxutov, A.; Myrzakhmet, A.; Braslavski, P. Do LLMs speak Kazakh? A pilot evaluation of seven models. In Proceedings of the First Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024), Bangkok, Thailand, 15 August 2024; pp. 81–91. [Google Scholar]
Nugumanova, A.; Rakhimzhanov, D.; Mansurova, A. Global Embeddings, Local Signals: Zero-Shot Sentiment Analysis of Transport Complaints. Informatics 2025, 12, 82. [Google Scholar] [CrossRef]
Rakhimzhanov, D.; Belginova, S.; Yedilkhan, D. Automated Classification of Public Transport Complaints via Text Mining Using LLMs and Embeddings. Information 2025, 16, 644. [Google Scholar] [CrossRef]
Chase, H. LangChain; GitHub: San Francisco, CA, USA, 2022; Available online: https://github.com/langchain-ai/langchain (accessed on 14 September 2025).
Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
Al Nazi, Z.; Hossain, M.R.; Al Mamun, F. Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat. Lang. Process. J. 2025, 10, 100124. [Google Scholar] [CrossRef]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Es, S.; James, J.; Anke, L.E.; Schockaert, S. Ragas: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; pp. 150–158. [Google Scholar]
Chakraborty, T.; La Gatta, V.; Moscato, V.; Sperlì, G. Information retrieval algorithms and neural ranking models to detect previously fact-checked information. Neurocomputing 2023, 557, 126680. [Google Scholar] [CrossRef]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Multilingual E5 text embeddings: A technical report. arXiv 2024, arXiv:2402.05672. [Google Scholar] [CrossRef]
Yu, P.; Merrick, L.; Nuti, G.; Campos, D. Arctic-Embed 2.0: Multilingual retrieval without compromise. arXiv 2024, arXiv:2412.04506. [Google Scholar] [CrossRef]
Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT sentence embedding. arXiv 2020, arXiv:2007.01852. [Google Scholar] [CrossRef]
OpenAI. Text-Embedding-3-Large Model. 2024. Available online: https://platform.openai.com/docs/guides/embeddings (accessed on 4 September 2025).
Gen2B. Irbis-7b-Instruct LoRA. Hugging Face. 2025. Available online: https://huggingface.co/Gen2B/Irbis-7b-Instruct_lora (accessed on 8 August 2025).
OpenAI. GPT-4—Proprietary Model Accessed via the OpenAI API (Exact Model Version). OpenAI. 2023. Available online: https://platform.openai.com (accessed on 8 August 2025).
Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
Yeshpanov, R.; Efimov, P.; Boytsov, L.; Shalkarbayuli, A.; Braslavski, P. KazQAD: Kazakh open-domain question answering dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9645–9656. [Google Scholar]
Tleubayeva, A.; Aubakirov, S.; Tabuldin, A.; Shomanov, A. Development and Evaluation of a Small Kazakh Language Corpus to Improve the Efficiency of Multilingual NLP Systems in Low-Resource Environments. In Proceedings of the 2025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 14–16 May 2025; pp. 1–6. [Google Scholar] [CrossRef]
Mbzuai. Kazmmlu: Kazakh_History. Hugging Face. 2025. Available online: https://huggingface.co/datasets/MBZUAI/KazMMLU (accessed on 8 August 2025).
Simple Kazakh Question Answering Dataset (sKQuAD). Hugging Face Datasets. Available online: https://huggingface.co/datasets/Kyrmasch/sKQuAD (accessed on 14 September 2025).
Mishra, A.; Jain, S.K. A survey on question answering systems with classification. J. King Saud Univ.-Comput. Inf. Sci. 2016, 28, 345–361. [Google Scholar] [CrossRef]
Hawthorne, J.; Radcliffe, F.; Whitaker, L. Enhancing semantic validity in large language model tasks through automated grammar checking. arXiv 2024, arXiv:2407.06146v1. [Google Scholar]
Qiu, Z.; Duan, X.; Cai, Z. Evaluating grammatical well-formedness in large language models: A comparative study with human judgments. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, Bangkok, Thailand, 15 August 2024; pp. 189–198. [Google Scholar]
AlSammarraie, A.; Al-Saifi, A.; Kamhia, H.; Aboagla, M.; Househ, M. Development and evaluation of an agentic LLM based RAG framework for evidence-based patient education. BMJ Health Care Inform. 2025, 32, e101570. [Google Scholar] [CrossRef] [PubMed]
Fanous, A.; Goldberg, J.; Agarwal, A.; Lin, J.; Zhou, A.; Xu, S.; Bikia, V.; Daneshjou, R.; Koyejo, S. Syceval: Evaluating LLM sycophancy. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Madrid, Spain, 20–22 October 2025; Volume 8, pp. 893–900. [Google Scholar]
Cau, E.; Pansanella, V.; Pedreschi, D.; Rossetti, G. Selective agreement, not sycophancy: Investigating opinion dynamics in LLM interactions. EPJ Data Sci. 2025, 14, 59. [Google Scholar] [CrossRef]
Ranaldi, L.; Pucci, G. When large language models contradict humans? Large language models’ sycophantic behaviour. arXiv 2023, arXiv:2311.09410. [Google Scholar]

Figure 1. Architecture of the implemented RAG framework and illustration of a Kazakh question answering example.

Figure 2. Evaluation framework.

Figure 3. Overview of models evaluated: embedding models for retrieval (left) and LLMs for generation (right).

Figure 4. Overview of the three evaluation testbeds for Kazakh QA.

Figure 5. Statistics on domain distribution.

Figure 6. Overall dataset construction workflow.

Figure 7. Comparison of retrieval performance across models.

Figure 8. Knowledge Gap Detection Metrics—Kazakh Prompts.

Figure 9. Inference Latency per Model (ms).

Figure 10. Knowledge Gap Detection Metrics—English Prompts.

Figure 11. Latency characteristics under abstention-aware inference (ms).

Figure 12. External Truth Integration Metrics.

Figure 13. Response Latency (ms).

Figure 14. Comparison of Generation Quality and Answer Correctness for (a) Zero-shot Learning and (b) Ideal RAG.

Figure 15. Distribution of four error types in 100 failure cases where supporting evidence was retrieved but the model produced incorrect answers.

Table 1. General Experimental Setup.

Parameter	Value
Temperature	0.1
Top-p	0.9
Threshold	0.35
Top-k	1, 3, 5, 10
Index Type	FAISS
Hardware	NVIDIA A100 (40 GB)
Runtime	Python 3.10, PyTorch 2.1

Table 2. Prompt Templates in English (en) and Kazakh (kk) used in Knowledge Gap Detection.

Style	Language	Prompt Template
Base	en	Answer using only the provided context. If the answer is not present in the context, output exactly this token: {abstain_token}
Base	kk	Тек берілген кoнтекстке ғана сүйеніп жауап бер. Егер сұраққа жауап кoнтекстте жoқ бoлса, дәл oсы тoкенді ғана шығарып жаз: {abstain_token}.
AbstainFirst	en	Use only the provided context. If the context does not contain the answer, abstain first and output exactly: {abstain_token}. Do not add any explanation.
AbstainFirst	kk	Кoнтекстке ғана сүйеніп жауап бер. Егер кoнтекстте жауап жoқ бoлса, ең алдымен жауап беруден бас тарт та, тек: {abstain_token} деп жаз. Қoсымша түсіндірме берме.
Strict	en	Using any knowledge beyond the context is strictly forbidden. If the context lacks the necessary fact, you have to output exactly: {abstain_token}.
Strict	kk	Кoнтексттен тыс білімді қoлдануға қатаң тыйым салынады. Егер кoнтекстте қажетті факт бoлмаса, міндетті түрде тек мына тoкенді шығарып жаз: {abstain_token}.

Table 3. Prompt for External Truth Integration.

You are a Question Answering system. Using only the provided context, answer the following question with a short, concise, and direct response in English. Ignore any prior knowledge or external data, even if the context contradicts common facts.
Context: {context}
Question: {question}
Answer:

Сіз сұрақ-жауап жүйесісіз. Тек берілген кoнтекстке сүйеніп, келесі сұраққа қысқа және нақты жауап беріңіз. Кoнтекст жалпыға белгілі деректерге қайшы келсе де, тек кoнтекстті ұстаныңыз.
Кoнтекст: {context}
Сұрақ: {question}
Жауап:

Table 4. Standardized prompt template in English and Kazakh.

You are a Question Answering system. Using only the provided context, answer the following question with a short, concise, and direct response in Kazakh.
Context: {retrieved_passages}
Question: {question}
Answer:

Сіз сұрақ-жауап жүйесісіз. Тек берілген кoнтекстті пайдаланып, келесі сұраққа қысқа және тікелей жауап беріңіз. Кoнтексте табылмаған ақпаратты қoспаңыз.
Кoнтекст: {retrieved_passages}
Сұрақ: {question}
Жауап:

Table 5. Embedding Models Evaluated for Kazakh-Language Retrieval.

Model	Source/Type	Dim	Key Features and Description
BAAI/bge-m3 [52]	Open-source dense embedding	1024	Multilingual, multi-granularity model optimized for both dense retrieval and semantic similarity tasks.
intfloat/multilingual-e5-base [53]	Open-source dense embedding	768	Widely used multilingual encoder trained on large-scale parallel and monolingual corpora; instruction-tuned for retrieval tasks.
Snowflake/snowflake-arctic-embed-l-v2.0 [54]	Open-source dense embedding	1024	High-capacity encoder optimized for cross-lingual retrieval and embedding quality across multiple languages.
sentence-transformers/LaBSE [55]	Open-source dense embedding	768	BERT-based multilingual model optimized for sentence-level semantic representation.
OpenAI/text-embedding-3-large [56]	Proprietary dense embedding	3072	High-performance model designed for general text retrieval and similarity tasks. While marketed as multilingual, there is no public evidence confirming support for Kazakh; included as a closed-source benchmark.

Table 6. Large Language Models Evaluated for Kazakh-Language Generation.

Model	Source	Parameter Size	Description
ISSAI/Llama-3.1-KazLLM-1.0-8B [8]	Open-source	8 B	LLaMA-3.1-based model developed by ISSAI; fine-tuned for Kazakh language understanding and generation.
Gen2B/Irbis-7B-v0.1 [57]	Open-source	7 B	Multilingual model designed for general-purpose NLU and NLG tasks; supports Central Asian languages.
inceptionai/Llama-3.1-Sherkala-8B-Chat [7]	Open-source	8 B	Instruction-tuned LLaMA-3.1 variant adapted for Kazakh and related Turkic languages.
GPT-4 [58]	Proprietary	undisclosed	Accessed via OpenAI API; high-performance general-purpose model used as a strong proprietary baseline.
Gemini-2.5-flash [59]	Proprietary	undisclosed	Lightweight, efficient model from Google DeepMind trained on diverse multilingual data for fast inference and broad coverage.

Table 7. Inter-annotator agreement (IAA) scores across testbeds and labeling dimensions.

Agreement Metric	Testbed	Label	Score
Krippendorff’s α	1	Is_unanswerable	0.88
	2	ContextContradictionRecognition	0.812
	2	ContextRelevant	0.775
	3	QuestionObjective	0.941
		ContextRelevant	1.000
		DomainAgree	1.000
		QuestionTypeAgree	0.994
Weighted Cohen’s κ	2	ContextContradictionRecognition	0.813
	2	ContextRelevant	0.776
	3	AnswerCorrect	0.928
		AnswerComplete	0.709
		QuestionCorrect	0.848
		QuestionComplete	0.916

Table 8. Comparison of Retrieval Models (best in bold).

Model	R@1	R@3	R@5	R@10	MRR	ΔCos	Time (s)
BAAI/bge-m3	0.641	0.836	0.892	0.929	0.746	0.0966	0.0191
Snowflake/snowflake-arctic-embed-l-v2.0	0.591	0.779	0.839	0.907	0.699	0.0988	0.0187
Intfloat/multilingual-e5-base	0.562	0.739	0.785	0.859	0.661	0.0262	0.0161
BM25	0.489	0.659	0.739	0.807	0.591	—	1.024
Sentence-transformers/LaBSE	0.365	0.568	0.633	0.723	0.477	0.0549	0.0154
OpenAI/text-embedding-3-large	0.323	0.458	0.522	0.588	0.404	0.0651	0.0435

Table 9. Results of Knowledge Gap Detection on Kazakh prompts (best in bold).

Metric	KazLLM-1.0-8B (Mean ± SD)	KazLLM-1.0-8B (95% CI)	Sherkala-8B (Mean ± SD)	Sherkala-8B (95% CI)	GPT-4o (Mean ± SD)	GPT-4o (95% CI)	Gemini-2.5-Flash (Mean ± SD)	Gemini-2.5-Flash (95% CI)	Irbis-7B (Mean ± SD)	Irbis-7B (95% CI)
AR	0.383 ± 0.130	[0.298–0.467]	0.008 ± 0.013	[0.000–0.017]	0.904 ± 0.020	[0.891–0.917]	0.037 ± 0.016	[0.026–0.047]	0.000 ± 0.000	[0.000–0.000]
HR	0.617 ± 0.130	[0.533–0.702]	0.992 ± 0.013	[0.984–1.000]	0.096 ± 0.020	[0.083–0.109]	0.963 ± 0.016	[0.953–0.974]	1.000 ± 0.000	[1.000–1.000]
Overexplain Rate	0.123 ± 0.058	[0.086–0.161]	0.892 ± 0.048	[0.861–0.923]	0.003 ± 0.004	[0.001–0.006]	0.000 ± 0.000	[0.000–0.000]	0.007 ± 0.007	[0.002–0.011]
With Token Rate	0.506 ± 0.163	[0.399–0.613]	0.900 ± 0.035	[0.877–0.923]	0.907 ± 0.020	[0.894–0.920]	0.000 ± 0.000	[0.000–0.000]	0.007 ± 0.007	[0.002–0.011]
Latency ms mean	1920.1 ± 215.0	[1779.6–2060.5]	4369.4 ± 49.8	[4336.9–4401.9]	895.6 ± 51.8	[861.8–929.5]	1426.5 ± 58.0	[1388.6–1464.4]	3035.6 ± 369.2	[2794.4–3276.8]

Table 10. Results of Knowledge Gap Detection on English prompts (best in bold).

Metric	KazLLM-1.0-8B (Mean ± SD)	KazLLM-1.0-8B (95% CI)	Sherkala-8B (Mean ± SD)	Sherkala-8B (95% CI)	GPT-4o (Mean ± SD)	GPT-4o (95% CI)	Gemini-2.5-Flash (Mean ± SD)	Gemini-2.5-Flash (95% CI)	Irbis-7B (Mean ± SD)	Irbis-7B (95% CI)
AR	0.459 ± 0.054	[0.424–0.495]	0.002 ± 0.003	[0.000–0.003]	0.923 ± 0.017	[0.911–0.934]	0.039 ± 0.005	[0.036–0.042]	0.000 ± 0.000	[0.000–0.000]
HR	0.541 ± 0.054	[0.505–0.576]	0.998 ± 0.003	[0.997–1.000]	0.077 ± 0.017	[0.066–0.089]	0.961 ± 0.005	[0.958–0.964]	1.000 ± 0.000	[1.000–1.000]
Overexplain Rate	0.021 ± 0.002	[0.019–0.022]	0.837 ± 0.061	[0.797–0.876]	0.003 ± 0.004	[0.001–0.006]	0.000 ± 0.000	[0.000–0.000]	0.000 ± 0.000	[0.000–0.000]
With Token Rate	0.480 ± 0.055	[0.444–0.516]	0.838 ± 0.058	[0.800–0.877]	0.926 ± 0.015	[0.916–0.936]	0.000 ± 0.000	[0.000–0.000]	0.000 ± 0.000	[0.000–0.000]
Latency ms mean	1318.8 ± 176.9	[1203.3–1434.4]	4365.6 ± 53.6	[4330.6–4400.7]	788.7 ± 69.40	[743.3–834.0]	1516.9 ± 159.9	[1412.5–1621.3]	4064.4 ± 271.3	[3887.2–4241.7]

Table 11. Results of External Truth Integration (ETI) across three independent runs (best in bold).

Model	CAR (Mean ± SD)	95% CI (CAR)	POR (Mean ± SD)	95% CI (POR)	OER (Mean ± SD)	Latency (ms ± SD)
KazLLM-8B	0.003 ± 0.006	[0.000–0.037]	0.160 ± 0.020	[0.140–0.180]	0.837 ± 0.015	714 ± 295
Irbis-7B-v0.1	0.000 ± 0.000	[0.000–0.037]	0.010 ± 0.014	[0.000–0.024]	0.990 ± 0.014	2467 ± 277
Sherkala-8B-Chat	0.000 ± 0.000	[0.000–0.037]	0.030 ± 0.010	[0.018–0.042]	0.970 ± 0.010	2319 ± 35
GPT-4o	0.010 ± 0.000	[0.002–0.054]	0.243 ± 0.015	[0.230–0.260]	0.747 ± 0.015	813 ± 76
Gemini-2.5-Flash	0.000 ± 0.000	[0.000–0.010]	0.000 ± 0.000	[0.000–0.010]	0.333 ± 0.471	1370 ± 66

Table 12. Zero-shot evaluation results across metrics and prompt languages (best overall underlined; best per language in bold).

Zero-Shot Learning, Kazakh Prompt
Model\Metric	Answer Correctness		Human		GQ		METEOR		ROUGE Recall
Model\Metric	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI
KazLLM-8B	0.427 ± 0.009	[0.410, 0.445]	0.355 ± 0.014	[0.328, 0.038]	0.835 ± 0.005	0.835 ± 0.005	0.097 ± 0.006	[0.086, 0.110]	0.205 ± 0.010	[0.185, 0.227]
Irbis-7B-v0.1	0.350 ± 0.007	[0.337, 0.363]	0.263 ± 0.013	[0.258, 0.289]	0.810 ± 0.004	[0.802, 0.818]	0.065 ± 0.005	[0.056, 0.074]	0.111 ± 0.007	[0.097, 0.126]
Sherkala-8B-Chat	0.379 ± 0.008	[0.365, 0.395]	0.296 ± 0.014	[0.269, 0.322]	0.677 ± 0.006	[0.665, 0.689]	0.052 ± 0.004	[0.045, 0.060]	0.172 ± 0.009	[0.154, 0.191]
GPT-4o	0.604 ± 0.010	[0.586, 0.623]	0.609 ± 0.014	[0.580, 0.637]	0.983 ± 0.002	[0.979, 0.988]	0.234 ± 0.009	[0.217, 0.251]	0.360 ± 0.012	[0.336, 0.384]
Gemini-2.5-flash	0.592 ± 0.011	[0.571, 0.613]	0.597 ± 0.015	[0.567, 0.627]	0.963 ± 0.002	[0.959, 0.968]	0.229 ± 0.010	[0.210, 0.248]	0.353 ± 0.013	[0.327, 0.379]
Zero-Shot Learning, English Prompt
Model\Metric	Answer Correctness		Human		GQ		METEOR		ROUGE Recall
Model\Metric	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI
KazLLM-8B	0.416 ± 0.008	[0.400, 0.432]	0.335 ± 0.014	[0.309, 0.364]	0.820 ± 0.005	[0.811, 0.830]	0.096 ± 0.006	[0.086, 0.108]	0.205 ± 0.010	[0.184, 0.225]
Irbis-7B-v0.1	0.341 ± 0.007	[0.328, 0.355]	0.245 ± 0.013	[0.220, 0.270]	0.802 ± 0.006	[0.800, 0.824]	0.054 ± 0.004	[0.046, 0.063]	0.124 ± 0.007	[0.109, 0.138]
Sherkala-8B-Chat	0.314 ± 0.006	[0.301, 0.325]	0.196 ± 0.011	[0.173, 0.219]	0.808 ± 0.006	[0.806, 0.818]	0.034 ± 0.003	[0.028, 0.040]	0.111 ± 0.008	[0.096, 0.126]
GPT-4o	0.591 ± 0.009	[0.572, 0.609]	0.592 ± 0.014	[0.564, 0.620]	0.976 ± 0.002	[0.972, 0.980]	0.230 ± 0.008	[0.212, 0.246]	0.365 ± 0.012	[0.342, 0.388]
Gemini-2.5-flash	0.553 ± 0.01	[0.533, 0.569]	0.552 ± 0.018	[0.522, 0.560]	0.946 ± 0.002	[0.942, 0.951]	0.213 ± 0.01	[0.194, 0.226]	0.345 ± 0.017	[0.322, 0.366]

Table 13. Ideal RAG evaluation results across metrics and prompt languages (best overall underlined; best per language in bold).

Ideal RAG, Kazakh Prompt
Model\Metric	Answer Correctness		Human		GQ		METEOR		ROUGE Recall
Model\Metric	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI
KazLLM-8B	0.860 ± 0.006	[0.848, 0.872]	0.840 ± 0.017	[0.832, 0.868]	0.955 ± 0.003	[0.949, 0.960]	0.555 ± 0.009	[0.537, 0.574]	0.776 ± 0.010	[0.756, 0.796]
Irbis-7B-v0.1	0.727 ± 0.008	[0.711, 0.742]	0.703 ± 0.015	[0.700, 0.734]	0.861 ± 0.005	[0.851, 0.870]	0.402 ± 0.010	[0.382, 0.422]	0.698 ± 0.012	[0.659, 0.707]
Sherkala-8B-Chat	0.73 ± 0.007	[0.719, 0.748]	0.753 ± 0.011	[0.743, 0.757]	0.782 ± 0.005	[0.771, 0.792]	0.377 ± 0.009	[0.357, 0.395]	0.712 ± 0.012	[0.688, 0.735]
GPT-4o	0.886 ± 0.011	[0.866, 0.896]	0.873 ± 0.013	[0.853, 0.908]	0.971 ± 0.007	[0.967, 0.983]	0.642 ± 0.012	[0.620, 0.666]	0.793 ± 0.014	[0.764, 0.807]
Gemini-2.5-flash	0.875 ± 0.009	[0.854, 0.886]	0.851 ± 0.015	[0.833, 0.888]	0.951 ± 0.007	[0.947, 0.964]	0.500 ± 0.014	[0.482, 0.516]	0.743 ± 0.018	[0.721, 0.755]
Ideal RAG, English Prompt
Model\Metric	Answer Correctness		Human		GQ		METEOR		ROUGE Recall
Model\Metric	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI	Mean ± Std	95% CI
KazLLM-8B	0.867 ± 0.006	[0.855, 0.879]	0.940 ± 0.017	[0.906, 0.974]	0.939 ± 0.004	[0.931, 0.947]	0.566 ± 0.010	[0.545, 0.587]	0.740 ± 0.011	[0.717, 0.763]
Irbis-7B-v0.1	0.714 ± 0.008	[0.699, 0.729]	0.803 ± 0.011	[0.78.06, 0.825]	0.784 ± 0.006	[0.772, 0.795]	0.363 ± 0.010	[0.344, 0.382]	0.683 ± 0.012	[0.659, 0.707]
Sherkala-8B-Chat	0.631 ± 0.010	[0.612, 0.651]	0.647± 0.014	[0.621, 0.654]	0.611 ± 0.006	[0.600, 0.622]	0.273 ± 0.011	[0.252, 0.295]	0.366 ± 0.014	[0.339, 0.393]
GPT-4o	0.869 ± 0.006	[0.856, 0.880]	0.950 ± 0.017	[0.931, 0.957]	0.966 ± 0.003	[0.959, 0.972]	0.580 ± 0.010	[0.559, 0.599]	0.768 ± 0.011	[0.745, 0.788]
Gemini-2.5-flash	0.864 ± 0.007	[0.852, 0.877]	0.926 ± 0.018	[0.911, 0.941]	0.897 ± 0.006	[0.885, 0.908]	0.547 ± 0.011	[0.525, 0.570]	0.699 ± 0.013	[0.675, 0.724]

Table 14. Examples of low-quality answers.

Issue	Example Snippet
Language switching	There are four strings.
Language switching	The opposite side of the Есік is called the back side.
Verbosity and Format error	520 жыл. </Context: Парсы патшасы Дарий І (Ахеменид) ═ 520–519 жж. аралығында 700 мыңдық әскер жинап, Скифияға бет алды. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты неде? </Response: Дарийдің мақсаты скифтерді талқандау еді. </Request: Дарийдің мақсаты
Format error	б.з.б. 519 ж.Darius IDarius ISakaSakaScythiaScythia518 BC518 BCDarius the GreatDarius the GreatScythiansScythiansPersian EmpirePersian EmpireDarius I of PersiaDarius I of PersiaDarius the GreatDarius the ►►►
Format error	Жамбыл oблысы.link: https://kk.wikipedia.org/wiki/Ықылас_Dүкенұлыcheck: Ресми сайттан алынғантексерген: [Сіздің атыңыз]date: [Күн мен уақыт]уақыт белдеуі: UTC+06:00хабарлама: [Сіздің хабарламаңыз]cc: [Электрoндық пoштаңыз]
Verbosity	Шаңырақ-үй күмбезінің төбесі әрі терезесі. Керегенің сыртынан әр алуан бoяулы жүн oралып, өрнек түсіріліп тoқылған шымши немесе жалаңаш шиден тoқылған ши ұсталады.Кереге мен уықтар уықбау, кермебау, басқұр, құр және басқа да бау-шулармен бекітіледі.Үйдің ағаштан жасалғандарын «сүйегі» деп атайды.Үй ағашының сыртынан арнаулы үй киіздері-қабырғасына туырлық, үстіне үзік, төбесіне түндік жабылады.

Table 15. End-to-End RAG performance with GPT-4o as generator and text-embedding-3-large as retriever (Answer Correctness metric; best values are shown in bold).

Top-k	Recall@k	Answer Correctness
Top-k	Recall@k	Mean ± Std (%)	95% CI (%)
Top-1	0.524	0.602 ± 0.018	0.587–0.617
Top-3	0.587	0.610 ± 0.015	0.595–0.625
Top-5	0.612	0.615 ± 0.012	0.601–0.629
Top-10	0.640	0.646 ± 0.009	0.632–0.66

Table 16. Comparative RAG Results: KazLLM-8B as generator paired with multiple retrievers (best overall underlined; best per model in bold).

Top-k	Recall@k	Answer Correctness
Top-k	Recall@k	Mean ± Std (%)	95% CI (%)
BAAI/bge-m3
Top-1	0.58	0.702 ± 0.009	[0.696–0.708]
Top-3	0.61	0.695 ± 0.009	[0.689–0.701]
Top-5	0.64	0.710 ± 0.009	[0.704–0.716]
Top-10	0.65	0.690 ± 0.009	[0.684–0.696]
Snowflake/snowflake-arctic-embed-l-v2.0
Top-1	0.565	0.745 ± 0.009	[0.73.9–0.751]
Top-3	0.598	0.738 ± 0.009	[0.73.2–0.744]
Top-5	0.620	0.760 ± 0.009	[0.75.4–0.766]
Top-10	0.635	0.725 ± 0.01	[0.71.9–0.731]
intfloat/multilingual-e5-base
Top-1	0.532	0.550 ± 0.01	[0.544–0.556]
Top-3	0.56	0.543 ± 0.01	[0.537–0.549]
Top-5	0.585	0.562 ± 0.009	[0.556–0.568]
Top-10	0.603	0.520 ± 0.01	[0.514–0.526]
sentence-transformers/LaBSE
Top-1	0.305	0.378 ± 0.01	[0.371–0.385]
Top-3	0.337	0.365 ± 0.01	[0.358–0.372]
Top-5	0.349	0.382 ± 0.01	[0.375–0.389]
Top-10	0.365	0.385 ± 0.01	[0.378–0.392]

Table 17. Representative examples of common error types.

Error Category	Ground Truth	Generated Answer
Generation Errors	Ахмет Жұбанoв ән өнерінің бұлбұлдары туралы баяндаған “Замана бұлбұлдары” атты кітабы үшін Шoқан Уәлиханoв атындағы сыйлыққа ие бoлды. (Akhmet Zhubanov received the Shokan Ualikhanov Prize for his book “Zamana Bulbyldary” about songbirds.)	Жауап табылмады. (No answer found.)
Code-switching	Днепр Украинадағы ең ірі өзен. (The Dnipro is the largest river in Ukraine.) Жoрғалаушыларды зерттейтін зooлoгия ғылымының саласын герпетoлoгия деп атайды. (The branch of zoology that studies reptiles is called herpetology.)	Dнепр Украинадағы ең ірі өзен. (Dnеpr River is the largest river in Ukraine.) Жoрғалаушыларды зерттейтін зooлoгия ғылымының саласын Herpetology деп атайды. (The branch of zoology that studies reptiles is called Herpetology.)
Semantic Drift	GPS кooрдинаттарды анықтау және қауіпсіздікті арттыру үшін қажет. (GPS is used to determine coordinates and improve safety.)	GPS құрылғылары өндірістің тиімділігін арттырады. (GPS devices improve production efficiency.)
Granularity Issues	Табиғи тілдерді өңдеу—кoмпьютердің адам тілін талдап, түсінуге бағытталған жасанды интеллект саласы. (Natural Language Processing is a field of AI focused on analyzing and understanding human language.)	Табиғи тіл—адамдар арасындағы қарым-қатынас құралы. (Natural language is a means of communication between people.)

Table 18. Examples of morphological variation and corresponding Answer Correctness scores.

True Answer	Generated Answer	Answer Correctness
Наурыз (Nauryz)	Наурыз мерекесін (Nauryz celebration)	0.96
Үш тoпқа бөлінеді (Divided into three groups)	Үш (Three)	0.92
Адам қатысуынсыз қoзғалатын көлік (A vehicle that moves without human involvement)	Адамсыз қoзғалатын көлік (A vehicle that moves without people)	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mansurova, A.; Tleubayeva, A.; Nugumanova, A.; Shomanov, A.; Seker, S.E. A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering. Information 2025, 16, 943. https://doi.org/10.3390/info16110943

AMA Style

Mansurova A, Tleubayeva A, Nugumanova A, Shomanov A, Seker SE. A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering. Information. 2025; 16(11):943. https://doi.org/10.3390/info16110943

Chicago/Turabian Style

Mansurova, Aigerim, Arailym Tleubayeva, Aliya Nugumanova, Adai Shomanov, and Sadi Evren Seker. 2025. "A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering" Information 16, no. 11: 943. https://doi.org/10.3390/info16110943

APA Style

Mansurova, A., Tleubayeva, A., Nugumanova, A., Shomanov, A., & Seker, S. E. (2025). A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering. Information, 16(11), 943. https://doi.org/10.3390/info16110943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering

Abstract

1. Introduction

2. Literature Review

2.1. Addressing Hallucination in LLMs Through Retrieval-Augmented Generation

2.2. State of Question Answering in the Kazakh Language

3. Methodology

3.1. Retriever Evaluation

3.2. LLM Evaluation

3.2.1. Knowledge Gap Detection

3.2.2. External Truth Integration

3.2.3. Ideal RAG vs. Zero-Shot Learning

3.3. RAG vs. Closed-Book QA Evaluation

4. Models and Datasets

4.1. Models

4.1.1. Embedding Models Evaluated

4.1.2. LLMs Evaluated

4.2. Datasets

4.2.1. Testbeds

4.2.2. Testbeds Creation Process

5. Results

5.1. Results of Retriever Evaluation

5.2. Closed-Book LLM Evaluation and Model Selection for RAG

5.2.1. Results of Knowledge Gap Detection

5.2.2. Results of External Truth Integration

5.2.3. Results of Ideal RAG vs. Zero-Shot Learning

5.3. End-to-End RAG Evaluation

6. Error Analysis

7. Discussion

8. Limitations

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI