Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model

Maxutov, Akylbek; Medeu, Nūrali; Varol, Huseyin Atakan

doi:10.3390/make8050128

Open AccessArticle

Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model

by

Akylbek Maxutov

^1,2,*

,

Nūrali Medeu

¹

and

Huseyin Atakan Varol

^1,2

¹

Institute of Smart Systems and Artificial Intelligence (ISSAI), Nazarbayev University, Astana 010000, Kazakhstan

²

Department of AI & Big Data, Faculty of Information Technologies and Artificial Intelligence, Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(5), 128; https://doi.org/10.3390/make8050128

Submission received: 6 March 2026 / Revised: 24 April 2026 / Accepted: 8 May 2026 / Published: 12 May 2026

Download

Browse Figures

Versions Notes

Abstract

Small language models (SLMs) are valued for their computational efficiency and suitability for edge deployment, but often underperform in localized linguistic and cultural contexts due to their limited parameter size. This study explores integrating real-time web search into Qolda, a 4B-parameter Kazakh-centric SLM, to close the performance gap with larger models. We assess two strategies: Naïve Retrieval-Augmented Generation (RAG), which uses raw benchmark questions as search queries, and Query-Refined RAG, which applies various refiner models, including a supervised distillation-tuned Qolda, to optimize queries. On the KazCulture and KazMMLU benchmarks, the Naïve RAG approach in reasoning-enabled mode achieved an average accuracy of 76.00%, improving on the Zero-Shot evaluation result of 60.37%, and, in this system-level comparison, exceeding the Zero-Shot accuracy of larger open-source models such as Qwen3-32B (64.72%) and Gemma-3-27b-it (60.24%), which were evaluated without retrieval augmentation. Query refinement improved the accuracy by about 3%, from 76.00% to 79.46%, but nearly doubled the computational cost. Inference time analysis shows that Naïve RAG adds approximately 1 s of retrieval latency per question. Query refiners introduce up to 4 s of additional overhead. However, the retrieved context reduces the time required for model reasoning in think mode. The most notable gains were observed in localized cultural knowledge, where web search integration correctly answered 32.9% of KazCulture questions that the Zero-Shot baseline failed on, while losing only 16.9% in return. These results suggest that retrieval-augmented SLMs can offer a cost-effective and competitive alternative to larger models for tasks in the domains of Kazakh language and Kazakh culture.

Keywords:

small language models; retrieval-augmented generation; web search; benchmarking

Graphical Abstract

1. Introduction

Natural Language Processing (NLP) is moving toward small language models (SLMs) to reduce computational demands and enable deployment on edge devices like mobile phones and laptops [1,2,3]. Although the definition of “small” is evolving, typically, models with fewer than 10 billion (10B) parameters are considered SLMs [4]. Alongside computational and memory efficiency, another motivation to adapt these models is their application in Agentic Artificial Intelligence (Agentic AI) tasks [4], where low latency is essential for autonomous workflows. Despite their smaller size, it is crucial for these models to maintain comparable performance levels. One approach for that is the integration of Retrieval-Augmented Generation (RAG). SLMs equipped with RAG can match or, in some domain-specific cases, even surpass the performance of Large Language Models (LLMs) [5]. Furthermore, the information source for RAG can be data scraped directly from the web [5,6]. Although the idea of integrating web search with language models is not new, recent research highlights a mutual benefit. Search engines offer real-time grounding to reduce hallucinations, while language models provide the reasoning needed to synthesize complex search results [7,8].

In addition, AI development in Kazakhstan has accelerated, driven by the creation of open-source Kazakh-centric models, training corpora, and evaluation benchmarks. Initiatives such as KazLLM (8B and 70B) [9] and Sherkala [10], both based on the Llama 3.1 architecture, have advanced language understanding adapted to the region’s agglutinative structure [11] and distinct cultural context. More recently, the 4B-parameter Qolda [12] model was introduced as a multimodal SLM with reasoning capabilities optimized for Kazakh-centric tasks, further advancing Generative AI (GenAI), particularly SLMs. These models have been released alongside extensive evaluation datasets, including KazMMLU [13], KazBench-kk [14], KazQAD [15], Qorgau [16], and KazCulture [17], which measure LLMs and SLMs across several Kazakh-language tasks.

Although Kazakh-centric models and benchmarks exist, effective integration of real-time web search into these SLM inference pipelines remains underexplored. This study investigates how a web-enhanced prompt–model–output cycle can improve the performance of a localized SLM. We aim to determine the best architectural approach for this integration, comparing direct prompt submission to search engines with advanced query refinement strategies. Gaining this understanding is crucial for building efficient, localized AI systems that deliver accurate, contextually relevant responses without incurring high computational costs from increased parameter counts. The primary contributions of this paper are summarized as follows:

We present the first systematic empirical study of real-time web search integration for a Kazakh-centric SLM. Our evaluation covers three benchmark families: KazMMLU, KazCulture, and MMLU-Pro. We compare multiple query-handling strategies, such as sending raw questions to the search engine, and rewriting queries using specialized refinement models.
We distinguish the impact of retrieval from that of query reshaping through controlled comparisons, and conduct a retrieval-quality analysis of the direct-question pipeline to explain why retrieval itself, rather than query reshaping, drives most of the observed accuracy gains.
We benchmark the retrieval-augmented SLM against larger, State-of-the-Art (SOTA) open-source models to assess whether external knowledge retrieval can bridge the parameter gap.

The rest of this paper is structured as follows. A review of the relevant literature on SLM development and RAG architectures is provided in Section 2. Section 3 details the methodology, including the design of the retrieval pipelines and the evaluation framework. The experimental findings and a quantitative analysis of the results are presented in Section 4, followed by a discussion of the implications and limitations of our findings in Section 5. Finally, Section 6 concludes the study and proposes directions for future work.

2. Related Works

The literature surrounding this study includes three related areas: the integration of web search and retrieval mechanisms into language models, the demonstrated effectiveness of RAG for enhancing SLMs, and the emerging body of work applying such techniques to low-resource and Kazakh-language settings.

2.1. Web Search Integration with Language Models

Search engines like Google, Bing or Baidu and language models such as Gemini, ChatGPT or Claude benefit each other. Search engines supply high-quality pretraining data, real-time retrieval context, and ranking signals to language models. In turn, language models enhance query formulation, summarization, and relevance ranking for search systems [7]. Recent information retrieval surveys show that components such as query rewriters, retrievers, rerankers, and readers have improved with large-scale language understanding [8].

Integrating web search with language models primarily addresses the staleness problem [18]. Because most LLMs are trained only once, they struggle with questions that require up-to-date information. For instance, FreshQA, a dynamic benchmark created to evaluate LLM responses to rapidly changing factual questions, is paired with FreshPrompt, a few-shot prompting method that adds live search results to the model’s context. This approach allows open-source models to compete with closed, commercial systems when search is included [18]. The modular LLM-Augmenter system further grounds outputs in external knowledge bases and revises prompts using automated factuality feedback, reducing hallucinations while maintaining fluency [19]. This strategy also improves fact-checking, as supplementing instruction-tuned models with web evidence achieves SOTA results on established benchmarks [20].

More advanced approaches go beyond simple retrieval. One effective example combines web search with structured knowledge graphs through multi-stage sparse and dense retrieval. Adding self-assessment mechanisms for evaluating response trustworthiness further improves performance on both factual and complex reasoning queries [21]. Web-enhanced question-answering (QA) systems built on moderately sized language models can also achieve results competitive with much larger models. This is made possible through carefully designed retrieval pipelines and human-preference-aware scoring strategies [22]. Collectively, these studies suggest that effective web integration requires deliberate architectural choices regarding retrieval quality, context organization, and query formulation, rather than simply appending search results to a prompt. This view is further supported by research on federated search within RAG frameworks, which shows that aggregating results from multiple heterogeneous sources significantly improves response quality compared to single-source retrieval [23].

Furthermore, recent research has shifted from static retrieval pipelines to adaptive architectures that determine when and how to retrieve information. Self-RAG [24] trains language models to emit reflection tokens that trigger on-demand retrieval and critique retrieved passages before generation, resulting in significant improvements over indiscriminate retrieval. Adaptive-RAG [25] builds on this by training a lightweight classifier to route queries among zero-retrieval, single-step, and multi-step strategies based on predicted complexity. Corrective RAG (CRAG) [26] introduces a retrieval evaluator that assesses document relevance and, if retrieval is incorrect, defaults to large-scale web search as a corrective knowledge source. This approach motivates our use of web search as the primary retrieval channel for low-resource languages with limited local corpora. Other work explores hybrid retrieval, combining dense semantic search with sparse lexical matching using fusion methods such as Reciprocal Rank Fusion, which consistently outperform single-paradigm retrievers [27,28]. These developments position retrieval as a dynamic, quality-aware component of the inference pipeline, rather than a fixed preprocessing step.

2.2. Retrieval-Augmented Generation for Small Language Models

Although early RAG research focused on large proprietary models, recent evidence highlights its value for smaller, resource-constrained models. For instance, in a CS1 programming course, a fine-tuned Mistral-7B variant was benchmarked against GPT-3.5-Turbo and GPT-4-32k across nine RAG configurations. Results show that SLM-RAG combinations can match or exceed the performance of larger models while preserving data privacy, which is increasingly important for localized deployments [6]. Similarly, integrating RAG into Llama-3.1-8B and Llama-3.2-3B running locally improves accuracy on the MMLU-Pro benchmark across domains such as biology, law, and computer security, leveraging domain-specific knowledge bases built from synthetic data and targeted web scraping [5]. Collectively, these studies demonstrate that RAG is not just a supplement to large models but an equalizer, enabling smaller architectures to achieve competitive performance.

2.3. Retrieval-Augmented Generation in Kazakh-Language Contexts

Although Kazakh-centric language models such as KazLLM, Sherkala, and the multimodal Qolda model have advanced rapidly, few studies have focused on RAG in this context. A systematic evaluation compared proprietary models (GPT-4o and Gemini-2.5-Flash) with open-source Kazakh-oriented models (KazLLM-8B and Sherkala-8B) in both RAG and closed-book settings for Kazakh question answering. Under optimal RAG conditions, KazLLM improved its answer correctness from 0.427 to 0.867, closely matching GPT-4o. In an end-to-end RAG setup, the open-source model even surpassed GPT-4o’s peak score [29]. These results strongly support RAG as a way to close the performance gap between smaller open-source models and large proprietary systems in low-resource language settings. Studies of LLM behavior under RAG in both English and Kazakh have also examined how models handle factual contradictions across prompt styles. Findings show that models are more sensitive to prompt variations in low-resource multilingual settings, highlighting the importance of careful query design in retrieval-augmented pipelines for Kazakh speakers [30]. In summary, despite the clear conclusion from the literature that RAG and web search integration improve both general and Kazakh-language model performance, the optimal method for integrating live web retrieval into a localized SLM inference pipeline for Kazakh-centric tasks remains underexplored. This study directly addresses this research gap.

3. Methodology

To analyze the impact of web search integration on SLM performance in the Kazakh language, this study uses a framework that highlights three main components: the search service provider, the benchmark datasets, and the architectural selection of an open-source SLM. Incorporating a search service into the evaluation pipeline enables measuring its impact in an isolated, reproducible environment. Additionally, modifying the input search query demonstrates how this action affects final downstream results (see Figure 1).

3.1. Web Search Service Comparison and Selection

In order to find a search service that is sufficiently reliable, low-latency and cost-efficient for integration of real-time web information into the SLM inference pipeline, we compared several leading Search Engine Results Page (SERP) Application Programming Interfaces (APIs). The primary evaluation criteria were cost per one thousand search queries, response time in seconds, and compatibility with the Kazakh language, some of which are shown in Table 1.

As seen in Table 1, the general-purpose search providers such as Google and Brave offer first-party proprietary search APIs (Google Vertex AI Search [36] and Brave API [33], respectively). However, these services are not cost-efficient ($4.00–$5.00 per 1000 queries) for large-scale benchmarking. On the other hand, the high-throughput third-party SERP scrapers such as Serper API [31] and SearchAPI [32] offer greater cost-efficiency ($0.50–$4.00 per 1000 queries), which was one of the reasons we selected Serper API as the search service for this study. Another reason is Serper’s low response latency of about 2 s per query, which is crucial, as it directly impacts the overall inference time in a RAG pipeline. Although it is slightly longer than the approximately 1 s latency of the other providers (see Table 1), it is acceptable for offline benchmarking workloads that prioritize throughput over real-time responsiveness. Also, due to this value being taken from the API’s documentation, it differs from the service’s actual latency. In practice, as seen in the later sections of this paper, the average latency of Serper API is typically lower. In addition, Serper API leverages the Google Search Engine output, which ensures broad coverage of Kazakh-language web content. In support of this, a sample of the API output illustrated in Figure 2 demonstrates the API’s ability to retrieve relevant results from the .kz domain. As seen in this sample, the service provides structured JavaScript Object Notation (JSON) outputs containing indexed pages ranked by relevance. For each retrieved document, the output includes a title, a Uniform Resource Locator (URL), and a content snippet, which are essential for augmenting the SLM’s context window with up-to-date, localized information.

3.2. The Benchmarking Datasets

To evaluate model performance across domain-specific and culturally relevant tasks, two primary open-source benchmarks were selected: KazMMLU [13] and KazCulture [17]. These datasets provide comprehensive coverage of both general educational knowledge and localized cultural nuances in the Kazakh language.

KazMMLU is a benchmark consisting of 22,889 questions in Kazakh and Russian across 37 categories including STEM, social sciences, and the humanities. In contrast, KazCulture is a culture-specific benchmark with a test set of 1334 questions that is used to evaluate language models’ cultural understanding of the Kazakh language. It covers categories such as traditional cuisine, the “Household” domain, and clothing. Both datasets are structured as multiple-choice questions (MCQs) with one correct answer. Crucially, both datasets underwent extensive expert linguistic validation by native speakers, ensuring high linguistic fidelity and delivering a reliable “ground truth” for measuring the impact of web search integration.

Additionally, MMLU-Pro [37] was included to monitor the performance of the SLM on challenging, “hard-to-search” reasoning questions. MMLU-Pro comprises approximately 12,000 difficult questions spanning various disciplines. An open-source machine-translated version of MMLU-Pro is available in Kazakh and Russian [38]. Evaluating the SLM on MMLU-Pro in English (EN), Kazakh (KK), and Russian (RU) helps clarify the impact of web search integration across these languages. A summary of these benchmarks is presented in Table 2 and the example prompts are given in Table 3.

3.3. Model Selection and Preliminary Benchmarking

The primary model selection focused on SOTA open-source Kazakh-centric SLMs: Llama-3.1-8B-KazLLM-1.0 (KazLLM) [9], Llama-3.1-8B-Sherkala-Chat (Sherkala) [10], and Qolda [12]. KazLLM and Sherkala are 8-billion-parameter models, whereas Qolda has 4 billion parameters. Qolda is also a vision–language model with reasoning capabilities. It operates in two modes: think mode, which uses Chain-of-Thought (CoT) reasoning by generating <think> tokens before providing an answer, and nothink mode, which bypasses this process and responds directly. To provide quantitative justification for model selection, these three models were evaluated on the KazMMLU, KazCulture, and MMLU-Pro benchmarks.

The evaluation was conducted in generation-based mode using an NVIDIA RTX 5090 Graphics Processing Unit (GPU) equipped with the vLLM inference engine to enhance evaluation speed. For each sample in every benchmark, the question and available options were presented to the model, which was instructed to return a JSON output containing the answer key indicating the correct letter of the selected option. The answer key with its value was then extracted from the JSON response and compared with the ground-truth label. A response was considered correct only in the case of an exact match. The total number of correctly answered questions was divided by the total sample size to calculate the accuracy.

The benchmarking results are presented in Table 4. As shown in the results, Qolda in standard inference mode (nothink) outperforms KazLLM and Sherkala across all three benchmarks, with an average lead of approximately 6%. In reasoning-enabled (think) mode, the performance gap increases, reaching an average of 61.24%, an improvement over the 8B-parameter models. Based on these results, as well as Qolda’s reasoning capabilities and smaller parameter size, Qolda was selected as the primary model for evaluating web search integration.

3.4. Web Search Integration and Search Query Optimization

In a standard SLM interaction, the user input is directly provided to the model, which then generates an output (see Figure 1, Pipeline A). In this study, a web search component is integrated prior to submitting the user prompt to the model. Information retrieved from the search service is incorporated into the Naïve RAG approach, enabling the model to receive both the original prompt and supporting information, thereby improving the response accuracy. In this configuration, the web search service receives the same input prompt as the model. For further comparison, an alternative approach is introduced: before submitting the input prompt to the web search service, a model refines the prompt into a searchable query, simulating human behavior. Three models are evaluated as potential query refiners: GPT-5-Nano, Gemini-3-Flash (both proprietary models), and Qolda. In the case of Qolda, using the question-answering model additionally as a refiner avoids reliance on third-party models. The complete pipeline is depicted in Figure 1.

Serper API functions as a web search tool. For each submitted prompt, it returns up to ten relevant search results. The snippets from these results are concatenated into a single block of text, which is then provided to the model along with the initial prompt. The refined approach (see Figure 1, Pipeline C) addresses a key challenge identified in preliminary experiments: the linguistic mismatch between formal academic benchmark questions and effective search engine queries. Standard samples from datasets such as KazMMLU often include lengthy prose and complex syntax, which generate lexical noise when used as search queries. Introducing a specialized preprocessing layer converts these academic prompts into concise, search-optimized keywords, clarifying the question’s intent and reducing irrelevant retrievals.

An algorithmic description of the pipeline is provided in Algorithm 1. Let

{Model}_{S L M}

represent the evaluator language model,

SearchAPI (\cdot)

the retrieval function that returns web snippets for a given query, and

Format (\cdot, \cdot)

the function that combines a prompt with retrieved snippets into an augmented prompt.

QueryOptimizer (\cdot)

denotes the optional query-refinement model. Given an input prompt P, the three pipelines generate a response R as follows:

Zero-Shot:

R = {Model}_{S L M} (P)

Naïve RAG:

R = {Model}_{S L M} (Format (P, SearchAPI (P)))

Query-Refined RAG:

R = {Model}_{S L M} (Format (P, SearchAPI (QueryOptimizer (P))))

Algorithm 1 Web search-enhanced inference for Kazakh-centric SLM: Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines

Require: Input Prompt P, Workflow Strategy

S_{m o d e}

Ensure: Final Model Response R

1:: {Determine the retrieval strategy}
2:: if $S_{m o d e} = Zero-Shot$ then
3:: {Standard inference without external context}
4:: $R \leftarrow {Model}_{S L M} (P)$
5:: else if $S_{m o d e} = Na ï ve RAG$ then
6:: {Direct retrieval using the raw user prompt}
7:: $C_{w e b} \leftarrow SearchAPI (P)$
8:: $P_{a u g m e n t e d} \leftarrow Format (P, C_{w e b})$
9:: $R \leftarrow {Model}_{S L M} (P_{a u g m e n t e d})$
10:: else if $S_{m o d e} = Query-Refined RAG$ then
11:: {Synthesize an optimized search query from the prompt}
12:: $Q_{o p t} \leftarrow QueryOptimizer (P)$
13:: $C_{w e b} \leftarrow SearchAPI (Q_{o p t})$
14:: $P_{a u g m e n t e d} \leftarrow Format (P, C_{w e b})$
15:: $R \leftarrow {Model}_{S L M} (P_{a u g m e n t e d})$
16:: end if
17:: return R

The retrieval function SearchAPI remains the same across all configurations; only its input changes. In Naïve RAG, the raw prompt serves as the query. In Query-Refined RAG, a language model rewrites P into a concise, keyword-focused query. Comparing Zero-Shot and Naïve RAG isolates the combined effect of retrieval and context assembly, while comparing Naïve RAG and Query-Refined RAG isolates the effect of query refinement.

Model selection for query refinement uses a tiered evaluation strategy. GPT-5-Nano and Gemini-3-Flash represent advanced proprietary models that deliver high semantic extraction accuracy with low latency for real-time, web-integrated pipelines. Including Qolda as its own optimizer allows the assessment of a self-sufficient pipeline. This comparison determines whether a localized, Kazakh-centric small language model can independently refine external knowledge retrieval without third-party APIs. The framework enables quantitative evaluation of performance gains from high-level reasoning models during retrieval compared to the localized inference model’s capabilities. The prompt used to refine the raw benchmark question into a search query is presented in Figure 3.

During query optimization, Qolda was used as a refiner only in nothink mode because think mode caused computational overhead. In a random sample of 100 benchmark questions, the model generated an average of 68.27 tokens in nothink mode, compared to 613.55 tokens in reasoning-enabled mode. Since this substantial increase in token generation occurred during the initial refinement step, before final evaluator inference, think mode was considered impractical for query refinement, as it increased inference time.

For each prompt, the Serper API returned up to ten results. Snippets were used as received, without filtering, reranking, or deduplication, and concatenated in the order provided by Serper into a single passage block. This block was inserted into the Context field of the evaluator prompt (see Figure 4). For Qolda inference, we used the decoding parameters recommended by the Qolda paper [12]. To be specific, a temperature of 0.6, a top-p of 0.95, a top-k of 20, and a min-p of 0 parameters in think mode, as well as a temperature of 0.7, a top-p of 0.8, a top-k of 20, and a min-p of 0 in nothink mode, were used. These parameters were applied consistently across all the Qolda configurations (Zero-Shot, Naïve RAG, and Query-Refined RAG) and benchmarks, ensuring that any accuracy differences reflect the pipeline design rather than decoding choices. No Serper responses were cached between the pipeline configurations. The retrieval runs for each configuration were executed independently within the same evaluation window.

3.5. Fine-Tuning

To improve the localized model’s query refinement performance, we fine-tuned Qolda using the Low-Rank Adaptation (LoRA) method with traditional supervised fine-tuning (SFT). The refined output of Gemini-3-Flash based on the KazMMLU dataset was used as training data for this LoRA. This dataset, consisting of 22,889 samples, was then randomly shuffled and split, with 90% of it becoming training data and 10% becoming validation data. We selected the KazMMLU benchmark as the primary training source for its bilingual coverage (Kazakh and Russian) and broad subject range. Other datasets were excluded for methodological reasons: MMLU-Pro was omitted because its reasoning-heavy questions are less suitable for keyword distillation, and KazCulture was excluded due to its small sample size, which limited effective model adaptation. The SFT training data includes only question–short search query pairs. The training procedure does not include ground-truth answer labels. It is important to note, however, that Qolda-SFT encounters KazMMLU questions during SFT and is later evaluated on KazMMLU as a downstream QA task. While no ground-truth answer labels are used during training, this still constitutes benchmark-specific exposure at the question level. Qolda-SFT may therefore learn surface-level regularities unique to KazMMLU, such as stylistic patterns, topic distribution, or typical phrasing, that improve its query reformulation on KazMMLU relative to unseen benchmarks. We accordingly interpret Qolda-SFT’s KazMMLU results as in-distribution evidence, closer to training-set validation than to held-out evaluation, and rely primarily on its performance on KazCulture and MMLU-Pro (EN, KK, and RU), which were not included in the SFT training data, to demonstrate that the learned query-compression strategy generalizes beyond KazMMLU-specific patterns.

The SFT LoRA was trained for three epochs with a Learning Rate (LR) of

10^{- 4}

, the cosine LR scheduler, a rank of 32, an

α

of 32, and the adamw_torch optimizer. The training process was optimized by using an effective batch size of 64 on a workstation with an RTX 5090 GPU with 32 GB of video memory. In addition, the training script was run using DeepSpeed with stage 2 optimization and Flash Attention 2 enabled. The maximum length of tokens during training was 1024 and the data type of the weights was bfloat16. Based on the training and validation error metrics, the LoRA model at the end of the third epoch was chosen, as its error consistently decreased both in training and validation, and it did not show signs of overfitting or catastrophic forgetting. The finalized LoRA was merged with the base Qolda model, and then the combination was tested for its performance. This fine-tuned version of Qolda was designed to reduce the end user’s dependency on proprietary APIs, enabling a fully locally deployable pipeline for Kazakh-language information retrieval.

3.6. Evaluation

We assessed the Naïve RAG and Query-Refined RAG pipelines using the same standardized benchmarking protocol as in preliminary model selection (see Section 3.3). Each configuration was evaluated on the KazMMLU, KazCulture, and MMLU-Pro datasets. Accuracy, defined as the ratio of correct answers to total questions, was the primary metric. By maintaining a consistent evaluation environment with the vLLM inference engine and identical system prompts (see Figure 4), we ensured a controlled comparison between unrefined search integration and LLM-based query optimization strategies.

To determine whether the observed gains of the Naïve RAG and Query-Refined RAG pipelines over the base Qolda model represent improvement rather than sampling variation, we computed 95% bootstrap confidence intervals for each configuration’s accuracy. We extracted binary correctness indicators for MMLU-Pro, KazMMLU, and KazCulture, and then applied a stratified vectorized bootstrap with 10,000 resamples with replacement to each array. We stratified the samples by benchmark and, within each benchmark, by subject category. Each category was resampled independently, preserving its original size. The stratum-level correctness arrays were then aggregated to compute a resampled overall accuracy. This approach ensures the bootstrap distribution reflects sampling variability within each category, rather than random reallocation of questions across categories. For MMLU-Pro, the three language subsets (EN, KK, and RU) were bootstrapped independently, with category-level stratification applied within each subset. For each dataset and pipeline configuration, we report the empirical mean accuracy and the corresponding 95% confidence interval.

Although accuracy remained the primary metric, web search integration increased the model’s context size. Hence, we also conducted a quantitative analysis of input size by calculating the average number of words processed by the evaluator model to assess the computational overhead. We further evaluated prompt-to-query transformation efficiency by comparing the average word counts of the original benchmark questions with those of the refined search queries generated by the optimizer models.

Inference time, accuracy, and context size were measured to provide a comprehensive comparison across configurations. For each pipeline, 100 benchmark questions were selected using a fixed-seed (seed = 19) pseudorandom sampler stratified across the three benchmarks. The same question set was used for all the pipeline configurations to ensure that latency differences reflect the pipeline design, not sample variation. The mean and standard deviation of per-sample generation time on a single RTX 5090 workstation in seconds were then calculated. Qolda-SFT was excluded from direct timing and assumed to have the same runtime as Qolda based on their identical model size and architecture. A warm-up request was sent to the local inference server before measurement to avoid a cold-start slowdown. For each evaluation, execution times for up to three sequential stages, search query generation, Serper-mediated web retrieval, and final multiple-choice inference, were recorded separately. Per-stage and total latencies were then aggregated into mean and standard deviation estimates for each pipeline component. This approach allowed direct assessment of the latency overhead introduced by each retrieval strategy compared to the Zero-Shot baseline.

Additionally, we assessed the quality of retrieved search result snippets for Naïve RAG using an LLM-as-a-judge approach, with Gemini-3.1-Flash-Lite as the evaluator. Snippets were classified into four groups based on their relevance to the question and options: Explicit, Supportive, Irrelevant, and Misleading. Explicit indicates the answer is directly present in the search results. Supportive means the results do not provide a direct answer but can inform the LLM’s reasoning. Irrelevant refers to results unrelated to the question. Misleading results include those that are self-contradictory or provide an incorrect answer. The prompt used to evaluate the quality of the retrieved snippet is given in Figure 5. To further evaluate retrieval quality, we counted correct and incorrect answers in each group to determine how each snippet type contributed to the pipeline’s QA performance.

We selected the best-performing RAG pipeline within our evaluated set based on the accuracy–latency trade-off, and then compared it to the Zero-Shot baseline using McNemar’s paired test. Predictions were paired by question across all three datasets and both reasoning modes. The test statistic was calculated from discordant pairs in the resulting 2 × 2 contingency tables. When there were enough discordant pairs, we used a

χ^{2}

approximation with continuity correction; otherwise, we applied an exact binomial test. The resulting p-values show whether the accuracy difference between the optimal pipeline and the Zero-Shot baseline represents a true improvement rather than sampling noise.

To summarize the trade-offs among cost, accuracy, and latency across all configurations, we plotted the results as a Pareto chart. Performance metrics for the six model configurations were plotted across the two reasoning modes (nothink and think), with latency on the x-axis and accuracy on the y-axis.

Finally, we benchmarked the optimal retrieval method against larger SOTA models, including Qwen3-32B and Gemma-3-27b-it, to determine if retrieval-augmented SLMs are viable alternatives to high-parameter architectures. Qwen3-32B and Gemma-3-27b-it were evaluated using the same Zero-Shot configuration as the Qolda baseline, with identical prompt templates, answer-extraction logic, and benchmark partitions, and without retrieval augmentation. This ensures the results are directly comparable to the Qolda Zero-Shot row. We used an RTX 5090 GPU for web search RAG approaches and an NVIDIA H100 GPU for Qwen3-32B and Gemma-3-27b-it.

4. Results

4.1. Comparative Performance of RAG Pipelines

Table 5 compares the accuracy percentages for the Zero-Shot baseline, Naïve RAG, and various Query-Refined RAG configurations. The Zero-Shot baseline results are the same as the ones for Qolda in Table 4. In these experiments, Qolda is the primary evaluator, operating in either nothink or think mode. Integrating web search consistently improved accuracy over the Zero-Shot baseline. As shown in Table 5, Qolda performed better when benchmark questions were entered directly into web search (Naïve RAG) than when query refinement was used. Also, optimizing queries with GPT-5-Nano resulted in lower performance than Naïve RAG, despite it outperforming the baseline.

Query optimization with Gemini-3-Flash achieved the highest overall performance, reaching 79.46% average accuracy in think mode of the Qolda evaluator. This improvement motivated the development of our localized Qolda-SFT model through distillation (see Section 3.5). Before evaluating the fine-tuned variants, we tested the unmodified Qolda as its own query refiner. Although this setup outperformed the Zero-Shot baseline and matched the GPT-optimized pipeline, it did not exceed the Naïve RAG results, making it impractical for general use. Qolda-SFT was the highest-performing localized query optimizer. As shown in Table 5, it outperformed the Zero-Shot baseline, GPT-5-Nano refiner, and base Qolda optimizer. Qolda-SFT’s performance was nearly identical to that of Naïve RAG, averaging 76.19% in think mode of the Qolda evaluator. It is especially effective in culture-specific scenarios. For example, on the KazCulture benchmark, Qolda-SFT achieved 66.34% in nothink mode, surpassing Naïve RAG (65.66%). However, since the average difference between SFT and Naïve RAG is marginal across all benchmarks, Naïve RAG remains the most efficient baseline for general use. Qolda-SFT was fine-tuned on KazMMLU-derived query reformulation examples and evaluated on the same benchmark. However, it did not outperform Naïve RAG on KazMMLU, suggesting that the distilled query reformulation strategy offers limited advantage over direct question-based retrieval. Out-of-distribution results are similarly informative. On English MMLU-Pro, Qolda-SFT achieves 49.94% in nothink mode (compared to 51.44% for Naïve RAG) and 67.28% in think mode (compared to 68.01% for Naïve RAG), staying within about one point of the best non-proprietary configuration, despite not being exposed to English reasoning questions during SFT training. Along with the earlier KazCulture results, this cross-benchmark consistency indicates that Qolda-SFT has learned a general query-compression approach, extracting keyword-dense search strings from academic MCQs, rather than memorizing KazMMLU-specific patterns.

While Gemini-3-Flash offers the best refinement, its average improvement over the simpler Naïve RAG approach is 3%, from 76.00% to 79.46%. As a proprietary external service, it may be impractical for privacy-sensitive or cost-constrained applications. The impact of the evaluator’s reasoning mode varied across datasets. As shown in Table 5, think mode generally improved accuracy on KazMMLU. However, performance declined on the KazCulture benchmark across several configurations. For example, the Gemini-3-Flash pipeline dropped from 70.61% to 69.64% when switching to think mode of the Qolda evaluator. This suggests that extended reasoning chains may sometimes introduce noise or “over-thinking” when addressing localized cultural nuances.

4.2. Multilingual Evaluation on MMLU-Pro

In Table 6, we compare Qolda’s performance on the English (EN), Kazakh (KK), and Russian (RU) subsets of the MMLU-Pro benchmark under different web search configurations. English questions consistently achieve the highest accuracy, followed by Russian and Kazakh. Query optimization using Gemini-3-Flash was the most effective retrieval strategy for complex reasoning tasks, achieving a peak average accuracy of 64.43% (average across the three languages) in think mode of the Qolda evaluator.

A detailed analysis of the English subset shows that web search integration improves performance in nothink mode of Qolda, increasing accuracy from 42.12% to 51.44% with Naïve RAG. In think mode, the improvement is smaller, from 65.00% to 68.01%. In contrast, web search integration has little or negative impact on the Kazakh and Russian subsets. This indicates that the current web indices or the model’s context processing are better optimized for English technical content than for localized translations in MMLU-Pro.

4.3. Statistical Significance of Accuracy Gains

To determine whether the observed accuracy gains reflect genuine performance improvements rather than random variation in benchmark sampling, we computed 95% bootstrap confidence intervals based on 10,000 resamples with replacement for each configuration and benchmark. The resulting intervals are reported in Table 7.

On KazMMLU and KazCulture, the Naïve RAG confidence intervals do not overlap with the Zero-Shot baseline in either reasoning mode, confirming that the improvements are statistically robust. For example, on KazMMLU in think mode, the Zero-Shot interval of [66.66, 67.86] is fully separated from the Naïve RAG interval of [82.47, 83.42], representing a 15-point gap not attributable to sampling. On KazCulture, the smaller test set of 1334 questions results in wider intervals (approximately ±2.5 points), yet the Zero-Shot interval of [46.85, 52.10] remains well below the Naïve RAG interval of [63.12, 68.22] in nothink mode. On MMLU-Pro, results vary by language: the Zero-Shot-to-Naïve-RAG gap is significant in nothink mode (37.66% [37.18, 38.16] vs. 41.23% [40.74, 41.73]), but narrows to near overlap in think mode (62.11% [61.61, 62.61] vs. 62.71% [62.21, 63.21]). This indicates that web search adds little value to advanced reasoning tasks when the model already reasons extensively.

The confidence intervals for the Gemini-3-Flash-over-Naïve-RAG margin, averaging about 3%, are also informative. For instance, on KazMMLU in think mode, Naïve RAG achieves 82.94% [82.47, 83.42], while Gemini-3-Flash refinement achieves 84.46% [84.00, 84.93]. The intervals are narrow and barely overlap, resulting in a 1.5-point difference. However, the practical value of this improvement should be considered in light of the additional 4–5 s of latency and reliance on a proprietary API.

4.4. Computational Overhead and Query Structural Analysis

Integrating web search improves accuracy on the KazMMLU and KazCulture benchmarks, but also leads to computational overhead. Moving from a Zero-Shot baseline to retrieval-augmented pipelines greatly increases the average input word count for the evaluator model. Including multiple search snippets requires more processing resources and raises token-processing demands during inference. For example, we observe in Table 8 that integrating RAG architectures substantially expands the input context. In the Zero-Shot baseline, the average input size, representing the raw benchmark question, did not exceed 22 words. When web search is enabled, the total word count processed by the evaluator model (Qolda) increases nearly nine times, with Naïve RAG averaging 185.05 words and the Qolda-SFT reaching 202.87 words. This growth is mainly due to concatenating search snippets, which increases the context size the model must process to generate a response.

Table 9 highlights the transformation of the original benchmark questions into refined search queries through keyword extraction and intent distillation. In the Naïve RAG configuration, the search query is identical to the baseline question (averaging 12.86 words), since the raw benchmark questions were used as the search queries. In contrast, the query refiner models generally produce more concise, semantically dense queries. For instance, Gemini-3-Flash and Qolda-SFT reduced the average query length to 7.21 and 6.39 words, respectively.

Although shorter search queries are efficient, the increase in retrieved context tokens does not always lead to higher accuracy. This indicates that while query optimization distillation (SFT) can generate compact queries that retrieve more data, the additional context may introduce noise or exceed the model’s effective attention span. The unrefined Naïve RAG baseline remains more cost-effective and performs better.

Table 10 presents a qualitative comparison of the query transformation strategies for the KazCulture benchmark. The original Naïve RAG prompt is a full interrogative sentence in Kazakh about traditional jewelry. GPT-5-Nano and Qolda refiners generally retain a sentence-like structure or add descriptive terms. In contrast, Gemini-3-Flash and Qolda-SFT models more effectively condense the prompt into semantically dense keywords. The Qolda-SFT model, distilled from Gemini, generates the most concise search string (“flat gold silver embossed bracelet”), removing grammatical fillers while retaining essential cultural entities for web retrieval. This result is consistent with Table 9, confirming that the SFT process enabled the smaller model to replicate the teacher model’s efficient, keyword-focused search behavior.

A central challenge in retrieval-augmented evaluation is determining whether accuracy gains arise from reasoning over relevant context or from direct exposure to answers in retrieved snippets. To address this, we used Gemini-3.1-Flash-Lite as an LLM-as-a-judge to classify retrieved snippets for 200 sampled questions from each benchmark language subset, totaling 1200 questions. After excluding 111 queries with empty retrievals, 1089 queries remained. Of these, 411 (37.7%) were Explicit, 212 (19.5%) Supportive, 437 (40.1%) Irrelevant, and 29 (2.7%) Misleading.

As shown in Table 11, retrieval quality significantly affects accuracy. However, Qolda continues to perform well above chance even with imperfect retrieval. Explicit snippets yield the highest accuracy (87.1% in nothink mode and 88.1% in think mode), though these results are below 100%, indicating that the model must still select the correct option even when the answer is present. Supportive snippets result in 51.4% (nothink) and 67.9% (think), both well above the 25% random-chance baseline for four-option multiple choice, which demonstrates genuine reasoning over partial evidence. Accuracy with Irrelevant snippets drops to 34.6% (nothink) and 55.1% (think), suggesting the model can partially compensate for unhelpful context using parametric knowledge, particularly in think mode. Misleading snippets produce the lowest accuracy (24.1% and 41.4%), indicating some vulnerability to adversarial retrieval, though the small sample size (29 queries) should be considered.

This breakdown leads to two main conclusions. First, the 37.7% Explicit rate sets an upper limit on the extent to which Naïve RAG improvements result from direct answer lookup. The above-chance accuracy with Supportive and Irrelevant snippets confirms that performance gains are not solely due to lookup. Second, the high proportion of Irrelevant retrievals (about 40%) explains why further query refinement, such as with Gemini-3-Flash, provides only marginal improvements. The primary bottleneck is not query phrasing but the limited coverage of Kazakh content in Google’s index.

The category-level breakdown reveals three recurring failure modes when Naïve RAG produces incorrect answers. Retrieval-grounded hallucinations correspond to the Irrelevant cell, the largest error group (286 incorrect responses in nothink mode and 196 in think mode), where the model confidently generates an incorrect answer based on parametric knowledge that the retrieved snippets neither support nor contradict. Snippet-induced errors correspond to the Misleading cell, where retrieved text directly contradicts the correct answer and the model follows the snippet. A third mode, over-reasoning on cultural content, is evident in the drop from 65.66% (nothink) to 62.59% (think) for Naïve RAG on KazCulture in Table 5. Extended reasoning sometimes dismisses a correct surface fact in favor of a more generic interpretation. These patterns suggest that selective retrieval is a promising direction for future work. For example, a lightweight prompt classifier could route culturally grounded questions to nothink-mode Naïve RAG, advanced reasoning questions to think-mode Zero-Shot, and general-knowledge questions to think-mode Naïve RAG.

4.5. Inference Time Analysis

Table 12 presents the total processing time per question for each pipeline configuration, divided into query generation, web search, and inference. In nothink mode, the Zero-Shot baseline completes each question in 0.07 s, reflecting local inference without retrieval. Naïve RAG increases this to 1.27 s, mainly due to the web search component (1.17 s). Query-Refined RAG introduces a query generation step, raising the total to 1.43 s, 2.58 s, and 5.03 s for Qolda, GPT-5-Nano, and Gemini-3-Flash as query refiners. The higher overhead for Gemini-3-Flash (

3.86

s for query generation) results from API round-trip latency for the remote model, unlike the locally deployed Qolda refiner.

In think mode, Naïve RAG (10.89 s) is slightly slower than the Zero-Shot baseline (10.28 s), with a marginal difference of 0.61 s. This difference results from two factors: the web search adds about 1.11 s of retrieval overhead, while the inference stage is faster for Naïve RAG (9.78 s) than for Zero-Shot (10.28 s). This suggests that retrieved context reduces the model’s reasoning burden and enables faster convergence compared to relying only on parametric knowledge. The standard deviations in Table 12 clarify these gaps. Retrieval-stage variability remains low, with standard deviations below 0.7 s across all configurations. In contrast, think-mode inference varies by approximately ±3.8 s due to differences in reasoning token lengths. The 0.61 s Naïve-RAG-over-Zero-Shot gap in think mode falls well within one standard deviation of inference-stage variability, aligning with the statistically non-significant McNemar’s test result (see Table 13) for MMLU-Pro in think mode.

4.6. Optimal RAG Pipeline and Comparison with Qwen3-32B and Gemma-3-27b-it

Figure 6 summarizes the trade-offs among pipeline variants by plotting average accuracy against average processing time per question. The figure highlights three key points. First, the Pareto frontiers for the nothink and think modes are structurally similar: Zero-Shot occupies the low-cost, low-accuracy, low-latency corner; Naïve RAG represents the most optimal midpoint; and Gemini-3-Flash refinement is at the opposite end of Zero-Shot, achieving the highest accuracy but with significantly more cost and latency. Second, GPT-5-Nano is strictly Pareto-dominated by Gemini-3-Flash in both modes, offering lower accuracy at similar latency and cost, making it an unsuitable choice for this workload. Third, the Qolda-SFT refiner has nearly equivalent performance to the plain Qolda refiner.

Qolda-SFT’s main advantage is operational independence rather than absolute accuracy. In terms of accuracy, Qolda-SFT and Naïve RAG perform similarly, with neither consistently outperforming the other. Nonetheless, Qolda-SFT is well-suited for privacy-sensitive deployments where sending Kazakh-language queries to third-party APIs is not acceptable, and for cost-sensitive scenarios where Gemini-3-Flash’s per-query fees are prohibitive at scale. These three configurations are complementary, each addressing different priorities in the trade-off between accuracy, latency, and cost.

The evaluation of the different retrieval strategies demonstrates that the Naïve RAG approach offers the most practical balance of performance, inference speed and architectural simplicity for the Qolda model. Although query refinement via Gemini-3-Flash achieves the highest absolute accuracy, the margin of improvement over the unrefined baseline is minimal, typically around 3%. Furthermore, utilizing the base Qolda model for self-refinement or implementing the specialized Qolda-SFT distillation provided no significant performance advantage over simply submitting the raw prompt. Despite the potential for query refinement to produce more concise search strings, the overall impact on the total input word count and resulting context size remains negligible across all RAG configurations. In terms of latency, Naïve RAG adds approximately 1 s of retrieval overhead per question. This is more time-efficient than Query-Refined RAG pipelines that use external refiners such as Gemini-3-Flash, which can add up to 4 s of additional latency. Thus, the direct integration of web search is the most beneficial solution, as it eliminates the computational latency of an additional model inference step without compromising the retrieval-augmented performance gains.

Table 13 summarizes the paired results of McNemar’s test comparing Naïve RAG and the Zero-Shot baseline across three benchmarks and both reasoning modes. Based on these results, Naïve RAG significantly outperforms Zero-Shot in five out of six benchmark–mode combinations (

p < 0.0001

), with the highest test statistics for KazMMLU in nothink (

χ^{2} = 2937.28

) and think (

χ^{2} = 1465.41

) modes. This indicates that accuracy gains on Kazakh-language benchmarks are robust to sampling variation. The exception is MMLU-Pro in think mode, where

p = 0.0978

shows no significant difference. This aligns with earlier findings that web search provides limited benefit for advanced reasoning tasks when the model already reasons extensively. The contingency matrix for this case shows 8403 questions where Zero-Shot was correct, and Naïve RAG was wrong, and 8620 where the reverse was true, resulting in a near-balanced disagreement and a marginal p-value. In contrast, for MMLU-Pro in nothink mode, the test yields

χ^{2} = 234.75

at

p < 0.0001

, highlighting the greater benefit of retrieved context when parametric reasoning is not used. Overall, these results confirm that Naïve RAG provides consistent, sampling-independent improvements on Kazakh-language tasks in both reasoning modes, while its benefit on MMLU-Pro depends on the reasoning mode.

This study examines two architectural choices: whether to retrieve at all and whether to refine the query before retrieval. The difference between Zero-Shot and Naïve RAG (for example, +16.18 points on KazCulture in nothink mode) reflects the combined impact of web retrieval and context assembly. In contrast, the difference between Naïve RAG and the best Query-Refined RAG configuration (about +3 points on average with Gemini-3-Flash) shows the added benefit of query refinement alone. This analysis leads to two key findings. First, retrieval and context assembly account for most of the observed improvement, contributing about five times as much as query refinement. Second, Table 11 indicates that about 40% of Naïve RAG retrievals are already classified as Irrelevant, which limits the potential of any query-refinement strategy. A refiner can rephrase a query but cannot generate target-language content that is missing from the web index. Therefore, the limited gains from query refinement are not due to the refiner models themselves but are a structural feature of retrieval in low-resource languages. Among the evaluated configurations, Naïve RAG provides a favorable balance of accuracy, latency, and cost; hence, we recommend it as the default out of the specific options we considered. Hybrid sparse and dense retrieval, reranking, relevance filtering, and cross-lingual retrieval were not assessed and may yield additional improvements.

To evaluate the impact of Naïve RAG integration, we generated confusion matrices for the KazMMLU and KazCulture benchmarks (see Figure 7). The results show that Naïve RAG and Zero-Shot each address errors missed by the other. On KazMMLU (think mode), Naïve RAG correctly answered 27.0% (6190 items) that Zero-Shot missed, but lost 11.4% (2600 items) that Zero-Shot had answered correctly, likely due to retrieval noise. A similar trend appears on KazCulture: in nothink mode, Naïve RAG gained 25.4% (339 items) while losing 9.2% (123 items), and in think mode, gained 32.9% (439 items) while losing 16.9% (226 items). The positive net difference across all settings indicates that web search integration is broadly beneficial, particularly for culturally specific content where the model’s parametric knowledge is limited.

After Naïve RAG is identified to be the most favorable integration strategy within our list of evaluated pipelines, it is further shown in Table 5 that, when enhanced with web search and operating in think mode, the 4-billion-parameter Qolda model consistently outperforms much larger open-source models, including Qwen3-32B and Gemma-3-27b-it. On the KazMMLU and KazCulture benchmarks, excluding MMLU-Pro due to its limited influence with web search in the Kazakh and Russian languages (see Table 6, MMLU-Pro KK and RU columns), Qolda in Naïve RAG think mode achieved an average accuracy of 76.00%, higher than Qwen3-32B (64.72%) and Gemma-3-27b-it (60.24%). The difference is especially evident on the KazCulture benchmark, where Naïve RAG (65.66%) surpasses the 32B model (40.14%) by over 25%. This comparison is conducted at the system level rather than the parameter level. Qwen3-32B and Gemma-3-27b-it are evaluated in their standard Zero-Shot configurations, while Qolda is assessed with Naïve RAG. These results show that a well-designed SLM-plus-retrieval pipeline can be a practical alternative to much larger models in default inference mode. This approach provides a realistic benchmark for deployment decisions, especially since adding retrieval to a 32B-class model would require additional infrastructure. We do not claim that retrieval-augmented Qolda is inherently parameter-wise stronger than Qwen3-32B or Gemma-3-27b-it. The retrieval-quality breakdown in Table 11 indicates that 37.7% of Naïve RAG queries return explicit snippets containing the answer, so part of Qolda’s advantage comes from direct evidence exposure. However, the above-chance accuracy with Supportive and Irrelevant snippets shows that pure lookup is not the only factor. On the Kazakh-language and Kazakh-cultural benchmarks evaluated here, a retrieval-augmented 4B-parameter model matches or exceeds the accuracy of Zero-Shot 27B and 32B open-source models, indicating that external knowledge retrieval can narrow the parameter gap in this specific setting. We do not claim that this pattern generalizes to arbitrary task domains; Section 5 discusses a language-asymmetric counterexample on MMLU-Pro.

5. Discussion

The experimental results show that using raw benchmark questions as search queries yields performance comparable to that of refined queries generated by advanced LLMs. For localized academic benchmarks, maintaining the full linguistic context of the original question appears to be similarly effective for retrieval to keyword-based optimization. This study demonstrates that web search integration enhances the performance of SLMs in non-English languages, such as Kazakh, helping bridge knowledge gaps in smaller, localized models.

The Qolda model demonstrates strong performance on the KazCulture benchmark in the system-level comparison. In Zero-Shot mode, Qolda surpasses the Zero-Shot accuracy of larger open-source models such as Qwen3-32B and Gemma-3-27b-it on KazCulture, with this performance gap further widening when web search is additionally integrated. For KazMMLU, real-time retrieval enables the 4B-parameter Qolda to outperform the 27B and 32B baselines on the same Kazakh-language subset. However, two caveats apply here. First, the larger baselines did not use any retrieval. A fair comparison with identical Naïve RAG retrieval for all models would likely reduce or eliminate the observed gap. Second, 37.7% of Naïve RAG queries return snippets containing the explicit answer. Therefore, much of the retrieval-augmented advantage comes from direct evidence rather than improved generalization. Thus, the specific claim of this work is as follows: retrieval-augmented deployment of a 4B-parameter Kazakh-centric SLM is competitive with the default Zero-Shot deployment of 27B–32B open-source baselines on Kazakh-language benchmarks. This finding is relevant for resource-constrained Kazakh-language AI deployments but does not imply parameter-for-parameter model superiority.

The effectiveness of web search varies significantly by task type. Table 6 demonstrates a clear language asymmetry on MMLU-Pro: web search integration increases Naïve RAG accuracy on the English subset from 42.12% to 51.44% in nothink mode and from 65.00% to 68.01% in think mode. In contrast, the effect on the Kazakh and Russian subsets is minimal or slightly negative. Kazakh drops from 33.36% to 33.01% in nothink mode, and Russian rises only from 37.50% to 39.24%. This pattern is consistent across all query-refinement variants and reasoning modes. We hypothesize two mechanisms that may jointly explain this asymmetry. First, MMLU-Pro emphasizes advanced STEM, law, business, and health topics, where English-language expert content dominates the web [39]. Under this hypothesis, Kazakh and Russian translations of MMLU-Pro often generate queries whose answers are primarily in English sources that Serper is unlikely to retrieve when the query is in Kazakh or Russian. Second, machine translation of MMLU-Pro into Kazakh and Russian often introduces terminological shifts and unnatural phrasing [40]. This may reduce retrieval precision for terms not found in natural Kazakh or Russian web content, though the model’s parametric reasoning is largely unaffected.

A logical next step is to implement a cross-lingual retrieval pipeline for advanced reasoning questions. This would involve translating Kazakh or Russian queries into English before submitting them to the search API, retrieving English-language snippets, and then either providing them directly to the evaluator (since Qolda supports English input) or translating the snippets back into the original language. If this approach narrows the MMLU-Pro gap for Kazakh and Russian, it would confirm that the main limitation is the availability of indexed content in the target language rather than the retrieval-augmented pipeline itself. This is particularly promising for deploying Kazakh-centric SLMs on advanced technical topics, where local web indices are unlikely to match English resources in the near term. Beyond cross-lingual retrieval, it is important to note that this study focused primarily on culture-related and school-level questions. While KazMMLU and MMLU-Pro include STEM and Engineering categories, the impact of web search on more specialized or complex tasks requires further investigation.

While there are clear performance gains in cultural and high-school contexts, the main drawback of this approach is its operational cost. Each external retrieval incurs a search API fee, which can be significant at scale. Web search integration also introduces measurable latency: Naïve RAG adds about 1 s per query, while Query-Refined RAG with remote query refiners such as Gemini-3-Flash can increase processing time by up to 4 s per query. Providing retrieved context in think mode reduces the model’s reasoning-stage time, partially offsetting retrieval overhead. These results suggest that web search integration is valuable for specific localized domains, but it is not a universal solution. Not all question types are suitable for web-enhanced prompting, so a selective approach is necessary. A dedicated prompt classifier could determine which prompts benefit from web search and which should bypass it.

Several additional directions remain open for future work. First, we evaluated performance using answer accuracy as the sole metric. This approach is appropriate for multiple-choice benchmarks with clear ground-truth labels and aligns with the protocols of the original KazMMLU, KazCulture, and MMLU-Pro papers. However, it leaves several related questions unaddressed. Calibration, or the alignment between the model’s confidence and its actual correctness, is not assessed. Retrieval-augmented pipelines can affect calibration, leading to overconfident predictions when the retrieved context is incorrect. Robustness to adversarial or noisy retrieval is also not systematically tested. While the Misleading-snippet category in Table 11 provides an initial signal (accuracy of 24.1% in nothink mode and 41.4% in think mode under Misleading context), a targeted adversarial study that adds distractor snippets or paraphrased incorrect answers would offer a more thorough evaluation. Additionally, the snippet-quality classification in Table 11 is based on a single LLM judge (Gemini-3.1-Flash-Lite). Because single-judge evaluations cannot measure inter-judge agreement and may introduce systematic biases, especially for non-English content and borderline Supportive versus Irrelevant cases, the reported breakdown of 37.7% Explicit, 19.5% Supportive, 40.1% Irrelevant, and 2.7% Misleading should be considered indicative rather than definitive. Broader findings, such as above-chance accuracy for Supportive and Irrelevant snippets and the roughly 40% off-topic rate, are likely robust to moderate judge noise. However, using a multi-judge ensemble or a human-validated sub-sample would provide stronger evidence. Finally, uncertainty due to prompt changes, such as option shuffling or paraphrasing, is not measured. We therefore present our accuracy findings as evidence of the empirical viability of web-search-enhanced SLMs in Kazakh, while highlighting calibration, adversarial robustness, and uncertainty estimation as important areas for future research.

A second limitation is our exclusive focus on live web search via the Serper API. We do not assess dense retrieval, sparse lexical retrieval, or hybrid approaches that combine dense and sparse signals using Reciprocal Rank Fusion [27,28] or adaptive weighting. While these methods have shown strong results in English and other high-resource languages, their effectiveness in Kazakh depends on the availability of high-quality embedding models and indexed corpora, both of which are still under active development in the Kazakh NLP community. Web search offers the practical benefit of relying on third-party indexes, making it an attractive default for resource-constrained languages. However, a controlled comparison with dense and hybrid retrieval on a curated Kazakh corpus would provide a more comprehensive understanding of the retrieval design space for Kazakh-centric SLMs. We plan to address this comparison in future work, along with adaptive routing strategies such as Adaptive-RAG [25] that select a retrieval paradigm per query based on predicted complexity.

6. Conclusions

This study investigated the impact of real-time web search integration on SLM performance within the Kazakh-language context. By evaluating the 4B-parameter Qolda model across the KazMMLU, KazCulture, and MMLU-Pro benchmarks, we demonstrated that a retrieval-augmented deployment of this model narrows its deployment-level accuracy gap with substantially larger open-source models’ Zero-Shot baselines on Kazakh-language tasks. Furthermore, while we evaluated specialized query refinement, including supervised fine-tuning (SFT) distillation from Gemini-3-Flash, hypothesizing it would have better performance, the Naïve RAG approach proved the most practical. The latter method consistently delivered strong accuracy gains with less architectural complexity and lower inference latency. On the matched Kazakh-language subset (KazMMLU-KK, KazMMLU-RU, and KazCulture), in reasoning-enabled mode, the augmented Qolda model achieved an average accuracy of 76.00%, exceeding the baseline Zero-Shot accuracy of larger open-source models, namely, Qwen3-32B (64.72%) and Gemma-3-27b-it (60.24%), evaluated on the same subsets. It is important to note that the larger models were not equipped with any retrieval augmentation. Web search integration had a notable impact on localized benchmarks such as KazCulture, yielding a 16.18% improvement over the Zero-Shot baseline, from 49.48% to 65.66%. However, a performance drop in think mode for cultural contexts was also observed, and it suggests that extended reasoning chains can sometimes introduce noise when addressing facts with specific localized nuances. On MMLU-Pro, retrieval benefits were concentrated in the English subset and were limited or negative for Kazakh and Russian, indicating that gains from web search are domain- and language-dependent rather than universal.

We clarify that our comparison with 27B–32B baselines is at the system level, not the parameter level. The larger models were evaluated in their default Zero-Shot configuration without retrieval. A parameter-matched comparison, where all models use identical retrieval, would likely reduce the observed performance gap. Our retrieval-quality analysis also indicates that a significant portion of the retrieval-augmented advantage comes from direct answer exposure through explicit-answer snippets.

The accuracy gains result in a measurable increase in latency. Naïve RAG adds about 1 s per query for web retrieval, and remote query refiners can introduce up to 4 s of additional overhead. However, retrieved context reduces model reasoning time in think mode, which partially offsets this latency. Web search integration increases input word counts nearly nine times compared to the baseline. Despite this overhead, our results show that retrieval-augmented SLMs offer a practical alternative to large-scale models for Kazakh cultural and educational tasks. These findings also encourage further research into selective and cross-lingual retrieval strategies. Future research should aim to improve the relevance of retrieved snippets, thereby reducing computational costs while maintaining high linguistic fidelity.

Author Contributions

Conceptualization, A.M. and H.A.V.; methodology, A.M. and N.M.; software, N.M.; validation, A.M.; investigation, A.M. and N.M.; resources, H.A.V.; data curation, A.M.; writing—original draft preparation, A.M., and N.M.; writing—review and editing, H.A.V.; visualization, N.M.; supervision, H.A.V.; project administration, H.A.V.; funding acquisition, H.A.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. BR24993001).

Data Availability Statement

The data presented in this study are available on the Hugging Face platform. These data were derived from the following resources available in the public domain: KazCulture (https://huggingface.co/datasets/issai/KazCulture (accessed on 18 January 2026)), KazMMLU (https://huggingface.co/datasets/MBZUAI/KazMMLU (accessed on 19 January 2026)), MMLU-Pro (https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), and MMLU-Pro Kazakh/Russian (https://huggingface.co/datasets/issai/MMLU-Pro_Kazakh_Russian (accessed on 20 January 2026)).

Acknowledgments

The authors would like to thank Madina Mansurova for supervising the grant that facilitated this research. The authors would like to acknowledge the use of Gemini-3-Pro (Google DeepMind) as a support tool during the preparation of this manuscript. Gemini-3-Pro was used solely to assist with debugging and refining the Python 3.12 code for confusion matrix visualization and Pareto chart creation. All research content, methodology, findings, and conclusions remain entirely the authors’ own work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Liu, Z.; Pan, S. The Rise of Small Language Models. IEEE Intell. Syst. 2025, 40, 30–37. [Google Scholar] [CrossRef]
Nguyen, C.V.; Shen, X.; Aponte, R.; Xia, Y.; Basu, S.; Hu, Z.; Chen, J.; Parmar, M.; Kunapuli, S.; Barrow, J.; et al. A Survey on Small Language Models. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing—Natural Language Processing in the Generative AI Era, Varna, Bulgaria, 8–10 September 2025; INCOMA Ltd.: Shoumen, Bulgaria, 2025; pp. 807–821. Available online: https://aclanthology.org/2025.ranlp-1.93/ (accessed on 20 February 2026).
Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. ACM Trans. Intell. Syst. Technol. 2025, 16, 145. [Google Scholar] [CrossRef]
Belcak, P.; Heinrich, G.; Diao, S.; Fu, Y.; Dong, X.; Muralidharan, S.; Lin, Y.C.; Molchanov, P. Small Language Models are the Future of Agentic AI. arXiv 2025, arXiv:2506.02153. [Google Scholar] [CrossRef]
Bharadwaj, A.; Jain, K. RAG-Assisted Small Language Models for Domain-Level Reasoning. In Proceedings of the 2025 Eighth International Conference on Image Information Processing (ICIIP), Solan, India, 27–29 November 2025; pp. 231–236. [Google Scholar] [CrossRef]
Liu, S.; Yu, Z.; Huang, F.; Bulbulia, Y.; Bergen, A.; Liut, M. Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science? In ITiCSE 2024: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1; Association for Computing Machinery: New York, NY, USA, 2024; pp. 388–393. [Google Scholar] [CrossRef]
Xiong, H.; Bian, J.; Li, Y.; Li, X.; Du, M.; Wang, S.; Yin, D.; Helal, S. When Search Engine Services Meet Large Language Models: Visions and Challenges. IEEE Trans. Serv. Comput. 2024, 17, 4558–4577. [Google Scholar] [CrossRef]
Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Chen, H.; Liu, Z.; Dou, Z.; Wen, J.R. Large Language Models for Information Retrieval: A Survey. ACM Trans. Inf. Syst. 2025, 44, 12. [Google Scholar] [CrossRef] [PubMed][Green Version]
Institute of Smart Systems and Artificial Intelligence. Kazakh Large Language Model (ISSAI KAZ-LLM). 2024. Available online: https://huggingface.co/collections/issai/issai-kazllm-10-6732d58c81bcaf177442c362 (accessed on 15 January 2026).
Koto, F.; Joshi, R.; Mukhituly, N.; Wang, Y.; Xie, Z.; Pal, R.; Orel, D.; Mullah, P.; Turmakhan, D.; Goloburda, M.; et al. Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting. arXiv 2025, arXiv:2503.01493. [Google Scholar]
Kessikbayeva, G.; Cicekli, I. Rule Based Morphological Analyzer of Kazakh Language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, Baltimore, MD, USA, 27 June 2014; Çetinoğlu, Ö., Heinz, J., Maletti, A., Riggle, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 46–54. [Google Scholar] [CrossRef]
Arystanbekov, B.; Nurimanov, A.; Maxutov, A.; Albrekht, V.; Kuzdeuov, A.; Varol, H.A. Qolda: A Small Vision–Language Model for the Kazakh Language. IEEE Access 2026, 14, 46392–46414. [Google Scholar] [CrossRef]
Togmanov, M.; Mukhituly, N.; Turmakhan, D.; Mansurov, J.; Goloburda, M.; Sakip, A.; Xie, Z.; Wang, Y.; Syzdykov, B.; Laiyk, N.; et al. KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 14403–14416. [Google Scholar] [CrossRef]
Umbet, S.; Murzakhmetov, S.; Sagyndyk, B.; Yakunin, K.; Akishev, T.; Zubitski, P. KazBench-KK: A Cultural-Knowledge Benchmark for Kazakh. In Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics, Vienna, Austria, 1 August 2025; pp. 38–57. Available online: https://aclanthology.org/2025.fieldmatters-1.4/ (accessed on 20 February 2026).
Yeshpanov, R.; Efimov, P.; Boytsov, L.; Shalkarbayuli, A.; Braslavski, P. KazQAD: Kazakh Open-Domain Question Answering Dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9645–9656. Available online: https://aclanthology.org/2024.lrec-main.843/ (accessed on 20 February 2026).
Goloburda, M.; Laiyk, N.; Turmakhan, D.; Wang, Y.; Togmanov, M.; Mansurov, J.; Sametov, A.; Mukhituly, N.; Wang, M.; Orel, D.; et al. Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 9765–9784. [Google Scholar] [CrossRef]
Maxutov, A.; Arystanbekov, B.; Makhataeva, Z.; Yergen, A.; Taizhanov, N.; Nauryzbaikyzy, G.; Varol, H.A. Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture. IEEE Access 2026, 14, 44027–44042. [Google Scholar] [CrossRef]
Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13697–13720. [Google Scholar] [CrossRef]
Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; et al. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arXiv 2023, arXiv:2302.12813. [Google Scholar] [CrossRef]
Cheung, T.H.; Lam, K.M. FactLLaMA: Optimizing Instruction-Following Language Models with External Knowledge for Automated Fact-Checking. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 846–853. [Google Scholar] [CrossRef]
Xie, W.; Liang, X.; Liu, Y.; Ni, K.; Cheng, H.; Hu, Z. WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs. arXiv 2024, arXiv:2408.07611. [Google Scholar]
Liu, X.; Lai, H.; Yu, H.; Xu, Y.; Zeng, A.; Du, Z.; Zhang, P.; Dong, Y.; Tang, J. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In KDD ’23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2023; pp. 4549–4560. [Google Scholar] [CrossRef]
Wang, S.; Khramtsova, E.; Zhuang, S.; Zuccon, G. FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation. In SIGIR ’24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2024; pp. 763–773. [Google Scholar] [CrossRef]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 7036–7050. [Google Scholar] [CrossRef]
Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar] [CrossRef]
Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR ’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2009; pp. 758–759. [Google Scholar] [CrossRef]
Mansurova, A.; Tleubayeva, A.; Nugumanova, A.; Shomanov, A.; Seker, S.E. A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering. Information 2025, 16, 943. [Google Scholar] [CrossRef]
Tleubayeva, A.; Mansurova, A.; Aubakirov, S.; Tabuldin, A.; Shomanov, A.; Makhambetova, Z. Multilingual QA-RAG: Evaluating LLMs’ Contradiction Handling in English and Kazakh. In Proceedings of the 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Busan, Republic of Korea, 25–27 June 2025; pp. 322–327. [Google Scholar] [CrossRef]
Serper. Serper.dev: The Fastest Google Search API. 2026. Available online: https://serper.dev/ (accessed on 24 February 2026).
SearchAPI. SearchAPI: Real-time SERP API for Google and other Search Engines. 2026. Available online: https://www.searchapi.io/ (accessed on 24 February 2026).
Brave Software. Brave Search API: Privacy-focused Web Search. 2026. Available online: https://brave.com/search/api/ (accessed on 24 February 2026).
LangSearch. LangSearch Documentation. 2026. Available online: https://docs.langsearch.com/ (accessed on 24 February 2026).
Perplexity AI. Perplexity AI Documentation. 2026. Available online: https://docs.perplexity.ai/ (accessed on 24 February 2026).
Google Cloud. Vertex AI Search and Conversation. 2026. Available online: https://cloud.google.com/use-cases/site-search (accessed on 24 February 2026).
Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 95266–95290. [Google Scholar] [CrossRef]
Institute of Smart Systems and Artificial Intelligence (ISSAI). MMLU-Pro Kazakh Russian Dataset. 2026. Available online: https://huggingface.co/datasets/issai/MMLU-Pro_Kazakh_Russian (accessed on 24 February 2026).
Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6282–6293. [Google Scholar] [CrossRef]
Plaza, I.; Melero, N.; del Pozo, C.; Conde, J.; Reviriego, P.; Mayor-Rocher, M.; Grandury, M. Spanish and LLM Benchmarks: Is MMLU Lost in Translation? arXiv 2024, arXiv:2406.17789. [Google Scholar] [CrossRef]

Figure 1. Our methodology for evaluating web search integration in a Kazakh-centric SLM. Pipeline (A) illustrates a standard SLM or LLM interaction. Pipeline (B) shows the Naïve RAG approach. An alternative approach is presented in Pipeline (C), where the input prompt is refined as a search query.

Figure 2. Excerpt from a sample Serper API response in the Kazakh language in JSON format. Ellipsis inside a red rectangle denotes a hidden part. The URLs in the displayed response (https://vk.com/wall-103226120_100318, https://muslim.kz/article/atqa-minu-balaga-payidaly) were retrieved via the Serper API on 20 January 2026.

Figure 3. Prompt instruction used for the query refiner model within the Query-Refined RAG pipeline.

Figure 4. Prompt instruction used for the answer-selection step in the multiple-choice evaluation pipeline. The retrieved web search snippets are concatenated into a single string and inserted into the Context field as the passage; when no retrieval is performed (e.g., in the Zero-Shot setting), the Context field is left empty.

Figure 5. Prompt instruction given to the judge LLM (Gemini-3.1-Flash-Lite) to assess the quality of retrieved search snippets in the Naïve RAG pipeline.

Figure 6. Pareto chart of the cost–accuracy–latency trade-off across pipelines.

Figure 7. Confusion matrix illustrating the impact of Naïve RAG (web search) on the KazMMLU (A,B) and KazCulture (C,D) benchmarks compared to the Zero-Shot baseline. Panels represent the performance of the Qolda evaluator in (A,C) nothink mode and (B,D) think mode.

Table 1. Comparison of web search API providers. The response times were taken from the respective documentation pages.

Search Service	Search Engine	Response Time	Cost per 1K Queries
Serper API [31]	Google	~2 s	$0.50–$1.00 *
SearchAPI [32]	Multiple	~2 s	$2.00–$4.00 *
Brave API [33]	Brave	~1 s	$5.00
LangSearch [34]	LangSearch	~1 s	$5.00
Perplexity [35]	Perplexity	~1 s	$5.00
Google Vertex AI Search [36]	Google	~1 s	$4.00

* Pricing varies based on selected subscription tier.

Table 2. Summary of the benchmarking datasets. In this study, only the test sets were used.

Benchmark	Questions	Languages	Domains
MMLU-Pro [37,38]	12,032 (in each lang.)	English, Russian, Kazakh	Advanced Reasoning in STEM, Law, Business, and Health
KazMMLU [13]	22,889	Kazakh, Russian	STEM, Social Sciences, Humanities, and Regional Knowledge
KazCulture [17]	1334	Kazakh	Kazakh Traditions, History, Cuisine, and National Games

Table 3. Examples of questions and the corresponding options from the benchmarking datasets with their English translations. In the options, the correct one is highlighted in bold.

Original Question (Kazakh)	Translation (English)
MMLU-Pro
Динамoметр ваттметрінің жүретін катушкасы тізбегіндегі кедергі қандай бoлуы керек? (A) Төмен. (B) Өте төмен. (C) Жoғары. (D) Дерлік нөл.	The resistance in the circuit of the moving coil of a dynamometer wattmeter should be (A) Low. (B) Very low. (C) High. (D) Almost zero.
KazMMLU
Тарихта бoлған адам екені дәлелденіп, Ақтөбе oблысында тарихи ескерткіші қoйылған батыр (A) Ер Қoсай. (B) Қамбар батыр. (C) Ер Тарғын. (D) Қoбыланды батыр.	A batyr (hero) whose historical existence has been proven and whose historical monument has been erected in the Aqtöbe region (A) Er Qosai. (B) Qambar batyr. (C) Er Tarğyn. (D) Qobylandy batyr.
KazCulture
Сүт тағамдарын қазақтар қалай бір сөзбен атайды? (A) Қымыз. (B) Қаймақ. (C) Сүт. (D) Ақ.	How do Kazakhs call dairy products in one word? (A) Qymyz (Kumys). (B) Qaimaq (Dairy cream). (C) Milk. (D) White.

Table 4. Performance comparison of Kazakh-centered SLMs across multilingual and regional benchmarks. All the values are reported as accuracy percentages (%).

Model	Mode	Average	MMLU-Pro			KazMMLU		KazCulture
Model	Mode	Average	EN	KK	RU	KK	RU	KazCulture
KazLLM (8B)	-	39.39	37.24	25.98	27.96	52.60	55.15	37.41
Sherkala (8B)	-	39.68	34.58	27.03	30.72	55.01	53.01	37.75
Qolda (4B)	nothink	45.21	42.12	33.36	37.50	51.93	56.85	49.48
Qolda (4B)	think	61.24	65.00	58.39	62.95	67.10	67.38	46.63

Table 5. Performance comparison on Kazakh-language benchmarks (KazMMLU and KazCulture) across the Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines. All values are accuracy (%). Deltas in parentheses indicate improvement

(+)

or decrease

(-)

relative to the baseline of the same mode. The Average column is the mean over KazMMLU-KK, KazMMLU-RU, and KazCulture. ref. denotes the query refiner model and eval. denotes the evaluator model.

Table 5. Performance comparison on Kazakh-language benchmarks (KazMMLU and KazCulture) across the Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines. All values are accuracy (%). Deltas in parentheses indicate improvement

(+)

or decrease

(-)

relative to the baseline of the same mode. The Average column is the mean over KazMMLU-KK, KazMMLU-RU, and KazCulture. ref. denotes the query refiner model and eval. denotes the evaluator model.

Method	Mode	Average	KazMMLU		KazCulture
Method	Mode	Average	KK	RU	KazCulture
Baseline (Qolda Zero-Shot)	nothink	52.75	51.93	56.85	49.48
Baseline (Qolda Zero-Shot)	think	60.37	67.10	67.38	46.63
Naïve RAG Qolda	nothink	71.98 (+19.23)	71.91 (+20.00)	78.36 (+21.51)	65.66 (+16.18)
Naïve RAG Qolda	think	76.00 (+15.63)	80.95 (+13.85)	84.45 (+17.07)	62.59 (+15.96)
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	nothink	66.01 (+13.26)	66.45 (+14.52)	75.27 (+18.42)	56.30 (+6.82)
GPT-5-Nano ref. Qolda eval.	think	71.14 (+10.77)	77.44 (+10.34)	81.87 (+14.49)	54.12 (+7.49)
Gemini-3-Flash ref. Qolda eval.	nothink	74.76 (+22.01)	74.73 (+22.80)	78.95 (+22.10)	70.61 (+21.13)
Gemini-3-Flash ref. Qolda eval.	think	79.46 (+19.09)	83.67 (+16.57)	85.06 (+17.68)	69.64 (+23.01)
Qolda ref. Qolda eval.	nothink	68.23 (+15.48)	69.20 (+17.27)	72.06 (+15.21)	63.42 (+13.94)
Qolda ref. Qolda eval.	think	73.16 (+12.79)	79.20 (+12.10)	79.27 (+11.89)	61.02 (+14.39)
Qolda-SFT ref. Qolda eval.	nothink	71.42 (+18.67)	72.42 (+20.49)	75.49 (+18.64)	66.34 (+16.86)
Qolda-SFT ref. Qolda eval.	think	76.19 (+15.82)	81.75 (+14.65)	82.19 (+14.81)	64.62 (+17.99)
Open-Source LLM Baselines
Qwen3-32B	nothink	60.28 (+7.53)	67.81 (+15.88)	72.88 (+16.03)	40.14 (−9.34)
Qwen3-32B	think	64.72 (+4.35)	75.48 (+8.38)	79.39 (+12.01)	39.28 (−7.35)
Gemma-3-27b-it	-	60.24 (+7.49)	64.89 (+12.96)	68.69 (+11.84)	47.13 (−2.35)

Table 6. Performance comparison on MMLU-Pro across the Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines, evaluated in English (EN), Kazakh (KK), and Russian (RU). All values are accuracy (%). Deltas in parentheses indicate improvement

(+)

or decrease

(-)

relative to the baseline of the same mode. The Average column is the mean over EN, KK, and RU. ref. denotes the query refiner model and eval. denotes the evaluator model.

Table 6. Performance comparison on MMLU-Pro across the Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines, evaluated in English (EN), Kazakh (KK), and Russian (RU). All values are accuracy (%). Deltas in parentheses indicate improvement

(+)

or decrease

(-)

relative to the baseline of the same mode. The Average column is the mean over EN, KK, and RU. ref. denotes the query refiner model and eval. denotes the evaluator model.

Method	Mode	Average	MMLU-Pro
Method	Mode	Average	EN	KK	RU
Baseline (Qolda Zero-Shot)	nothink	37.66	42.12	33.36	37.50
Baseline (Qolda Zero-Shot)	think	62.11	65.00	58.39	62.95
Naïve RAG Qolda	nothink	41.23 (+3.57)	51.44 (+9.32)	33.01 (−0.35)	39.24 (+1.74)
Naïve RAG Qolda	think	62.71 (+0.60)	68.01 (+3.01)	57.53 (−0.86)	62.60 (−0.35)
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	nothink	43.51 (+5.85)	50.52 (+8.40)	37.50 (+4.14)	42.51 (+5.01)
GPT-5-Nano ref. Qolda eval.	think	63.26 (+1.15)	67.66 (+2.66)	58.85 (+0.46)	63.27 (+0.32)
Gemini-3-Flash ref. Qolda eval.	nothink	43.81 (+6.15)	53.89 (+11.77)	35.11 (+1.75)	42.42 (+4.92)
Gemini-3-Flash ref. Qolda eval.	think	64.43 (+2.32)	70.50 (+5.50)	58.63 (+0.24)	64.17 (+1.22)
Qolda ref. Qolda eval.	nothink	40.38 (+2.72)	48.97 (+6.85)	33.22 (−0.14)	38.96 (+1.46)
Qolda ref. Qolda eval.	think	61.75 (-0.36)	66.21 (+1.21)	57.15 (−1.24)	61.88 (−1.07)
Qolda-SFT ref. Qolda eval.	nothink	40.93 (+3.27)	49.94 (+7.82)	33.20 (−0.16)	39.65 (+2.15)
Qolda-SFT ref. Qolda eval.	think	62.24 (+0.13)	67.28 (+2.28)	57.22 (−1.17)	62.23 (−0.72)

Table 7. Accuracy with 95% confidence intervals across methods and reasoning modes on MMLU-Pro, KazMMLU, and KazCulture. All values are reported as percentages (%). ref. denotes the query refiner model; eval. denotes the evaluator model.

Method	Mode	MMLU-Pro	KazMMLU	KazCulture
Baseline (Qolda Zero-Shot)	nothink	37.66 [37.18, 38.16]	54.72 [54.08, 55.36]	49.48 [46.85, 52.10]
Baseline (Qolda Zero-Shot)	think	62.11 [61.61, 62.61]	67.26 [66.66, 67.86]	46.63 [43.93, 49.25]
Naïve RAG Qolda	nothink	41.23 [40.74, 41.73]	75.58 [75.02, 76.14]	65.67 [63.12, 68.22]
Naïve RAG Qolda	think	62.71 [62.21, 63.21]	82.94 [82.47, 83.42]	62.59 [59.97, 65.22]
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	nothink	43.51 [42.99, 44.01]	71.47 [70.87, 72.05]	56.30 [53.60, 58.92]
GPT-5-Nano ref. Qolda eval.	think	63.26 [62.76, 63.75]	79.96 [79.44, 80.48]	54.12 [51.42, 56.75]
Gemini-3-Flash ref. Qolda eval.	nothink	43.81 [43.28, 44.32]	77.13 [76.58, 77.67]	70.61 [68.14, 73.01]
Gemini-3-Flash ref. Qolda eval.	think	64.43 [63.93, 64.92]	84.46 [84.00, 84.93]	69.64 [67.17, 72.11]
Qolda ref. Qolda eval.	nothink	40.38 [39.87, 40.88]	70.82 [70.24, 71.41]	63.42 [60.79, 66.04]
Qolda ref. Qolda eval.	think	61.75 [61.25, 62.25]	79.24 [78.71, 79.76]	61.02 [58.40, 63.57]
Qolda-SFT ref. Qolda eval.	nothink	40.93 [40.42, 41.43]	74.17 [73.59, 74.75]	66.34 [63.79, 68.82]
Qolda-SFT ref. Qolda eval.	think	62.24 [61.73, 62.74]	82.00 [81.50, 82.51]	64.62 [62.07, 67.24]

Table 8. Average input size per question in word count. Deltas in parentheses indicate the increase

(+)

in word count relative to the baseline. ref. denotes the query refiner model and eval. denotes the evaluator model.

Table 8. Average input size per question in word count. Deltas in parentheses indicate the increase

(+)

in word count relative to the baseline. ref. denotes the query refiner model and eval. denotes the evaluator model.

Method	Average	MMLU-Pro	KazMMLU	KazCulture
Baseline (Qolda Zero-Shot)	12.86	21.99	9.67	6.92
Naïve RAG Qolda	185.05 (+172.19)	178.12 (+156.13)	194.86 (+185.19)	182.17 (+175.25)
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	180.75 (+167.89)	218.28 (+196.29)	191.36 (+181.69)	132.61 (+125.69)
Gemini-3-Flash ref. Qolda eval.	191.42 (+178.56)	186.12 (+164.13)	204.79 (+195.12)	183.36 (+176.44)
Qolda ref. Qolda eval.	189.32 (+176.46)	185.10 (+163.11)	204.97 (+195.30)	177.88 (+170.96)
Qolda-SFT ref. Qolda eval.	202.87 (+190.01)	200.45 (+178.46)	211.99 (+202.32)	196.16 (+189.24)

Table 9. Average search query size per question in word counts. Deltas in parentheses indicate the change (+ or −) relative to Naïve RAG. ref. denotes the query refiner model and eval. denotes the evaluator model.

Method	Average	MMLU-Pro	KazMMLU	KazCulture
Naïve RAG Qolda	12.86	21.99	9.67	6.92
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	13.20 (+0.34)	16.84 (−5.15)	12.02 (+2.35)	10.73 (+3.81)
Gemini-3-Flash ref. Qolda eval.	7.21 (−5.65)	9.12 (−12.87)	6.95 (−2.72)	5.56 (−1.36)
Qolda ref. Qolda eval.	9.00 (−3.86)	12.36 (−9.63)	8.38 (−1.29)	6.26 (−0.66)
Qolda-SFT ref. Qolda eval.	6.39 (−6.47)	8.13 (−13.86)	5.73 (−3.94)	5.31 (−1.61)

Table 10. Qualitative comparison of the search query optimization strategies across various refiner models for the KazCulture benchmark. English translations are provided in italics. ref. denotes the query refiner model and eval. denotes the evaluator model.

Method	Query Transformation (KazCulture)
Naïve RAG Qolda	Жалпақ алтын, күміс бетіндегі бедерлі ернеуге түрлі түсті тастар oрнатылып жасалған білезік қалай аталады?
Naïve RAG Qolda	What is the name of a bracelet made with colorful stones set into an embossed rim on its flat gold or silver surface?
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	бедерлі білезік атауы түрлі түсті тастар oрнатылған білезік
GPT-5-Nano ref. Qolda eval.	Embossed bracelet name bracelet with inset colorful stones
Gemini-3-Flash ref. Qolda eval.	бедерлі ернеуге түрлі түсті тастар oрнатылған білезік қалай аталады
Gemini-3-Flash ref. Qolda eval.	What is the name of a bracelet with colorful stones set into an embossed rim
Qolda ref. Qolda eval.	Түрлі түсті тастары бар бедерлі алтын білезік қалай аталады?
Qolda ref. Qolda eval.	What is the name of an embossed gold bracelet with colorful stones?
Qolda-SFT ref. Qolda eval.	жалпақ алтын күміс бедерлі білезік
Qolda-SFT ref. Qolda eval.	Flat gold silver embossed bracelet

Table 11. Web search snippet quality vs. answer correctness for the Naïve RAG pipeline under nothink and think modes. Snippets are categorized as Explicit, Supportive, Irrelevant, or Misleading based on their relationship to the true answer. Of 1200 queries, 111 returned no snippets and are excluded, yielding 1089 evaluated queries. Values are accuracy percentages (%) with raw counts in parentheses.

Mode	Result	Total	Snippet Category
Mode	Result	Total	Explicit	Supportive	Irrelevant	Misleading
Total Count		1089	411	212	437	29
nothink	Correct	57.4 (625)	87.1 (358)	51.4 (109)	34.6 (151)	24.1 (7)
nothink	Incorrect	42.6 (464)	12.9 (53)	48.6 (103)	65.4 (286)	75.9 (22)
think	Correct	69.7 (759)	88.1 (362)	67.9 (144)	55.1 (241)	41.4 (12)
think	Incorrect	30.3 (330)	11.9 (49)	32.1 (68)	44.9 (196)	58.6 (17)

Table 12. Average end-to-end processing time per question, decomposed into query generation, web search, and inference stages. Values are seconds (s), reported as mean ± standard deviation over 100 queries sampled from the benchmark questions. ref. denotes the query refiner model; eval. denotes the evaluator model.

Method	Mode	Total	Query Generation	Web Search	Inference
Baseline (Qolda Zero-Shot)	nothink	$0.07 \pm 0.20$	0.00	0.00	$0.07 \pm 0.20$
Baseline (Qolda Zero-Shot)	think	$10.28 \pm 3.92$	0.00	0.00	$10.28 \pm 3.92$
Naïve RAG Qolda	nothink	$1.27 \pm 0.63$	0.00	$1.17 \pm 0.64$	$0.10 \pm 0.14$
Naïve RAG Qolda	think	$10.89 \pm 3.89$	0.00	$1.11 \pm 0.58$	$9.78 \pm 3.87$
Query Refiner Model in Query-Refined RAG
GPT-5-Nano ref. Qolda eval.	nothink	$2.58 \pm 0.90$	$1.36 \pm 0.89$	$1.12 \pm 0.52$	$0.10 \pm 0.17$
GPT-5-Nano ref. Qolda eval.	think	$11.48 \pm 3.80$	$1.37 \pm 0.71$	$1.09 \pm 0.49$	$9.02 \pm 3.80$
Gemini-3-Flash ref. Qolda eval.	nothink	$5.03 \pm 1.35$	$3.86 \pm 1.33$	$1.07 \pm 0.52$	$0.10 \pm 0.10$
Gemini-3-Flash ref. Qolda eval.	think	$14.38 \pm 3.91$	$4.08 \pm 1.27$	$1.11 \pm 0.56$	$9.19 \pm 3.81$
Qolda ref. Qolda eval.	nothink	$1.43 \pm 0.64$	$0.24 \pm 0.41$	$1.10 \pm 0.57$	$0.10 \pm 0.14$
Qolda ref. Qolda eval.	think	$11.51 \pm 3.82$	$0.23 \pm 0.39$	$1.13 \pm 0.61$	$10.16 \pm 3.81$

Table 13. Results of McNemar’s test on the Zero-Shot Baseline and Naïve RAG pipelines. The column names in the contingency matrix denote which of the pipelines gave the correct answer. The phrase “RAG Win” denotes that Naïve RAG significantly outperformed the baseline with

p < 0.05

.

Table 13. Results of McNemar’s test on the Zero-Shot Baseline and Naïve RAG pipelines. The column names in the contingency matrix denote which of the pipelines gave the correct answer. The phrase “RAG Win” denotes that Naïve RAG significantly outperformed the baseline with

p < 0.05

.

Benchmark	Mode	Contingency Matrix				$χ^{2}$	p-Value	Result
Benchmark	Mode	Both	Baseline	RAG	Neither	$χ^{2}$	p-Value	Result
MMLU-Pro	nothink	10,711	2884	4172	18,329	234.75	<0.0001	RAG Win
MMLU-Pro	think	14,017	8403	8620	5056	2.74	0.0978	Tie
KazMMLU	nothink	11,035	1491	6265	4098	2937.28	<0.0001	RAG Win
KazMMLU	think	12,795	2600	6190	1304	1465.41	<0.0001	RAG Win
KazCulture	nothink	537	123	339	335	100.05	<0.0001	RAG Win
KazCulture	think	396	226	439	273	67.58	<0.0001	RAG Win

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Maxutov, A.; Medeu, N.; Varol, H.A. Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Mach. Learn. Knowl. Extr. 2026, 8, 128. https://doi.org/10.3390/make8050128

AMA Style

Maxutov A, Medeu N, Varol HA. Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Machine Learning and Knowledge Extraction. 2026; 8(5):128. https://doi.org/10.3390/make8050128

Chicago/Turabian Style

Maxutov, Akylbek, Nūrali Medeu, and Huseyin Atakan Varol. 2026. "Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model" Machine Learning and Knowledge Extraction 8, no. 5: 128. https://doi.org/10.3390/make8050128

APA Style

Maxutov, A., Medeu, N., & Varol, H. A. (2026). Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Machine Learning and Knowledge Extraction, 8(5), 128. https://doi.org/10.3390/make8050128

Article Menu

Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model

Abstract

1. Introduction

2. Related Works

2.1. Web Search Integration with Language Models

2.2. Retrieval-Augmented Generation for Small Language Models

2.3. Retrieval-Augmented Generation in Kazakh-Language Contexts

3. Methodology

3.1. Web Search Service Comparison and Selection

3.2. The Benchmarking Datasets

3.3. Model Selection and Preliminary Benchmarking

3.4. Web Search Integration and Search Query Optimization

3.5. Fine-Tuning

3.6. Evaluation

4. Results

4.1. Comparative Performance of RAG Pipelines

4.2. Multilingual Evaluation on MMLU-Pro

4.3. Statistical Significance of Accuracy Gains

4.4. Computational Overhead and Query Structural Analysis

4.5. Inference Time Analysis

4.6. Optimal RAG Pipeline and Comparison with Qwen3-32B and Gemma-3-27b-it

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI