Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model
Abstract
1. Introduction
- We present the first systematic empirical study of real-time web search integration for a Kazakh-centric SLM. Our evaluation covers three benchmark families: KazMMLU, KazCulture, and MMLU-Pro. We compare multiple query-handling strategies, such as sending raw questions to the search engine, and rewriting queries using specialized refinement models.
- We distinguish the impact of retrieval from that of query reshaping through controlled comparisons, and conduct a retrieval-quality analysis of the direct-question pipeline to explain why retrieval itself, rather than query reshaping, drives most of the observed accuracy gains.
- We benchmark the retrieval-augmented SLM against larger, State-of-the-Art (SOTA) open-source models to assess whether external knowledge retrieval can bridge the parameter gap.
2. Related Works
2.1. Web Search Integration with Language Models
2.2. Retrieval-Augmented Generation for Small Language Models
2.3. Retrieval-Augmented Generation in Kazakh-Language Contexts
3. Methodology
3.1. Web Search Service Comparison and Selection
3.2. The Benchmarking Datasets
3.3. Model Selection and Preliminary Benchmarking
3.4. Web Search Integration and Search Query Optimization
| Algorithm 1 Web search-enhanced inference for Kazakh-centric SLM: Zero-Shot, Naïve RAG, and Query-Refined RAG pipelines |
| Require: Input Prompt P, Workflow Strategy Ensure: Final Model Response R
|
3.5. Fine-Tuning
3.6. Evaluation
4. Results
4.1. Comparative Performance of RAG Pipelines
4.2. Multilingual Evaluation on MMLU-Pro
4.3. Statistical Significance of Accuracy Gains
4.4. Computational Overhead and Query Structural Analysis
4.5. Inference Time Analysis
4.6. Optimal RAG Pipeline and Comparison with Qwen3-32B and Gemma-3-27b-it
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, Q.; Liu, Z.; Pan, S. The Rise of Small Language Models. IEEE Intell. Syst. 2025, 40, 30–37. [Google Scholar] [CrossRef]
- Nguyen, C.V.; Shen, X.; Aponte, R.; Xia, Y.; Basu, S.; Hu, Z.; Chen, J.; Parmar, M.; Kunapuli, S.; Barrow, J.; et al. A Survey on Small Language Models. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing—Natural Language Processing in the Generative AI Era, Varna, Bulgaria, 8–10 September 2025; INCOMA Ltd.: Shoumen, Bulgaria, 2025; pp. 807–821. Available online: https://aclanthology.org/2025.ranlp-1.93/ (accessed on 20 February 2026).
- Wang, F.; Zhang, Z.; Zhang, X.; Wu, Z.; Mo, T.; Lu, Q.; Wang, W.; Li, R.; Xu, J.; Tang, X.; et al. A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness. ACM Trans. Intell. Syst. Technol. 2025, 16, 145. [Google Scholar] [CrossRef]
- Belcak, P.; Heinrich, G.; Diao, S.; Fu, Y.; Dong, X.; Muralidharan, S.; Lin, Y.C.; Molchanov, P. Small Language Models are the Future of Agentic AI. arXiv 2025, arXiv:2506.02153. [Google Scholar] [CrossRef]
- Bharadwaj, A.; Jain, K. RAG-Assisted Small Language Models for Domain-Level Reasoning. In Proceedings of the 2025 Eighth International Conference on Image Information Processing (ICIIP), Solan, India, 27–29 November 2025; pp. 231–236. [Google Scholar] [CrossRef]
- Liu, S.; Yu, Z.; Huang, F.; Bulbulia, Y.; Bergen, A.; Liut, M. Can Small Language Models With Retrieval-Augmented Generation Replace Large Language Models When Learning Computer Science? In ITiCSE 2024: Proceedings of the 2024 on Innovation and Technology in Computer Science Education V. 1; Association for Computing Machinery: New York, NY, USA, 2024; pp. 388–393. [Google Scholar] [CrossRef]
- Xiong, H.; Bian, J.; Li, Y.; Li, X.; Du, M.; Wang, S.; Yin, D.; Helal, S. When Search Engine Services Meet Large Language Models: Visions and Challenges. IEEE Trans. Serv. Comput. 2024, 17, 4558–4577. [Google Scholar] [CrossRef]
- Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Chen, H.; Liu, Z.; Dou, Z.; Wen, J.R. Large Language Models for Information Retrieval: A Survey. ACM Trans. Inf. Syst. 2025, 44, 12. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Institute of Smart Systems and Artificial Intelligence. Kazakh Large Language Model (ISSAI KAZ-LLM). 2024. Available online: https://huggingface.co/collections/issai/issai-kazllm-10-6732d58c81bcaf177442c362 (accessed on 15 January 2026).
- Koto, F.; Joshi, R.; Mukhituly, N.; Wang, Y.; Xie, Z.; Pal, R.; Orel, D.; Mullah, P.; Turmakhan, D.; Goloburda, M.; et al. Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting. arXiv 2025, arXiv:2503.01493. [Google Scholar]
- Kessikbayeva, G.; Cicekli, I. Rule Based Morphological Analyzer of Kazakh Language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, Baltimore, MD, USA, 27 June 2014; Çetinoğlu, Ö., Heinz, J., Maletti, A., Riggle, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 46–54. [Google Scholar] [CrossRef]
- Arystanbekov, B.; Nurimanov, A.; Maxutov, A.; Albrekht, V.; Kuzdeuov, A.; Varol, H.A. Qolda: A Small Vision–Language Model for the Kazakh Language. IEEE Access 2026, 14, 46392–46414. [Google Scholar] [CrossRef]
- Togmanov, M.; Mukhituly, N.; Turmakhan, D.; Mansurov, J.; Goloburda, M.; Sakip, A.; Xie, Z.; Wang, Y.; Syzdykov, B.; Laiyk, N.; et al. KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 14403–14416. [Google Scholar] [CrossRef]
- Umbet, S.; Murzakhmetov, S.; Sagyndyk, B.; Yakunin, K.; Akishev, T.; Zubitski, P. KazBench-KK: A Cultural-Knowledge Benchmark for Kazakh. In Proceedings of the Fourth Workshop on NLP Applications to Field Linguistics, Vienna, Austria, 1 August 2025; pp. 38–57. Available online: https://aclanthology.org/2025.fieldmatters-1.4/ (accessed on 20 February 2026).
- Yeshpanov, R.; Efimov, P.; Boytsov, L.; Shalkarbayuli, A.; Braslavski, P. KazQAD: Kazakh Open-Domain Question Answering Dataset. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 9645–9656. Available online: https://aclanthology.org/2024.lrec-main.843/ (accessed on 20 February 2026).
- Goloburda, M.; Laiyk, N.; Turmakhan, D.; Wang, Y.; Togmanov, M.; Mansurov, J.; Sametov, A.; Mukhituly, N.; Wang, M.; Orel, D.; et al. Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 9765–9784. [Google Scholar] [CrossRef]
- Maxutov, A.; Arystanbekov, B.; Makhataeva, Z.; Yergen, A.; Taizhanov, N.; Nauryzbaikyzy, G.; Varol, H.A. Introducing Cultural Knowledge in Language Models: KazCulture Dataset for Kazakh Culture. IEEE Access 2026, 14, 44027–44042. [Google Scholar] [CrossRef]
- Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13697–13720. [Google Scholar] [CrossRef]
- Peng, B.; Galley, M.; He, P.; Cheng, H.; Xie, Y.; Hu, Y.; Huang, Q.; Liden, L.; Yu, Z.; Chen, W.; et al. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback. arXiv 2023, arXiv:2302.12813. [Google Scholar] [CrossRef]
- Cheung, T.H.; Lam, K.M. FactLLaMA: Optimizing Instruction-Following Language Models with External Knowledge for Automated Fact-Checking. In Proceedings of the 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Taipei, Taiwan, 31 October–3 November 2023; pp. 846–853. [Google Scholar] [CrossRef]
- Xie, W.; Liang, X.; Liu, Y.; Ni, K.; Cheng, H.; Hu, Z. WeKnow-RAG: An Adaptive Approach for Retrieval-Augmented Generation Integrating Web Search and Knowledge Graphs. arXiv 2024, arXiv:2408.07611. [Google Scholar]
- Liu, X.; Lai, H.; Yu, H.; Xu, Y.; Zeng, A.; Du, Z.; Zhang, P.; Dong, Y.; Tang, J. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. In KDD ’23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2023; pp. 4549–4560. [Google Scholar] [CrossRef]
- Wang, S.; Khramtsova, E.; Zhuang, S.; Zuccon, G. FeB4RAG: Evaluating Federated Search in the Context of Retrieval Augmented Generation. In SIGIR ’24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2024; pp. 763–773. [Google Scholar] [CrossRef]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv 2023, arXiv:2310.11511. [Google Scholar] [CrossRef]
- Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 7036–7050. [Google Scholar] [CrossRef]
- Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
- Gao, L.; Ma, X.; Lin, J.; Callan, J. Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 1762–1777. [Google Scholar] [CrossRef]
- Cormack, G.V.; Clarke, C.L.A.; Buettcher, S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR ’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2009; pp. 758–759. [Google Scholar] [CrossRef]
- Mansurova, A.; Tleubayeva, A.; Nugumanova, A.; Shomanov, A.; Seker, S.E. A Systematic Evaluation of Large Language Models and Retrieval-Augmented Generation for the Task of Kazakh Question Answering. Information 2025, 16, 943. [Google Scholar] [CrossRef]
- Tleubayeva, A.; Mansurova, A.; Aubakirov, S.; Tabuldin, A.; Shomanov, A.; Makhambetova, Z. Multilingual QA-RAG: Evaluating LLMs’ Contradiction Handling in English and Kazakh. In Proceedings of the 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Busan, Republic of Korea, 25–27 June 2025; pp. 322–327. [Google Scholar] [CrossRef]
- Serper. Serper.dev: The Fastest Google Search API. 2026. Available online: https://serper.dev/ (accessed on 24 February 2026).
- SearchAPI. SearchAPI: Real-time SERP API for Google and other Search Engines. 2026. Available online: https://www.searchapi.io/ (accessed on 24 February 2026).
- Brave Software. Brave Search API: Privacy-focused Web Search. 2026. Available online: https://brave.com/search/api/ (accessed on 24 February 2026).
- LangSearch. LangSearch Documentation. 2026. Available online: https://docs.langsearch.com/ (accessed on 24 February 2026).
- Perplexity AI. Perplexity AI Documentation. 2026. Available online: https://docs.perplexity.ai/ (accessed on 24 February 2026).
- Google Cloud. Vertex AI Search and Conversation. 2026. Available online: https://cloud.google.com/use-cases/site-search (accessed on 24 February 2026).
- Wang, Y.; Ma, X.; Zhang, G.; Ni, Y.; Chandra, A.; Guo, S.; Ren, W.; Arulraj, A.; He, X.; Jiang, Z.; et al. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 95266–95290. [Google Scholar] [CrossRef]
- Institute of Smart Systems and Artificial Intelligence (ISSAI). MMLU-Pro Kazakh Russian Dataset. 2026. Available online: https://huggingface.co/datasets/issai/MMLU-Pro_Kazakh_Russian (accessed on 24 February 2026).
- Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6282–6293. [Google Scholar] [CrossRef]
- Plaza, I.; Melero, N.; del Pozo, C.; Conde, J.; Reviriego, P.; Mayor-Rocher, M.; Grandury, M. Spanish and LLM Benchmarks: Is MMLU Lost in Translation? arXiv 2024, arXiv:2406.17789. [Google Scholar] [CrossRef]







| Search Service | Search Engine | Response Time | Cost per 1K Queries |
|---|---|---|---|
| Serper API [31] | ~2 s | $0.50–$1.00 * | |
| SearchAPI [32] | Multiple | ~2 s | $2.00–$4.00 * |
| Brave API [33] | Brave | ~1 s | $5.00 |
| LangSearch [34] | LangSearch | ~1 s | $5.00 |
| Perplexity [35] | Perplexity | ~1 s | $5.00 |
| Google Vertex AI Search [36] | ~1 s | $4.00 |
| Benchmark | Questions | Languages | Domains |
|---|---|---|---|
| MMLU-Pro [37,38] | 12,032 (in each lang.) | English, Russian, Kazakh | Advanced Reasoning in STEM, Law, Business, and Health |
| KazMMLU [13] | 22,889 | Kazakh, Russian | STEM, Social Sciences, Humanities, and Regional Knowledge |
| KazCulture [17] | 1334 | Kazakh | Kazakh Traditions, History, Cuisine, and National Games |
| Original Question (Kazakh) | Translation (English) |
|---|---|
| MMLU-Pro | |
Динамoметр ваттметрінің жүретін катушкасы тізбегіндегі кедергі қандай бoлуы керек?
| The resistance in the circuit of the moving coil of a dynamometer wattmeter should be
|
| KazMMLU | |
Тарихта бoлған адам екені дәлелденіп, Ақтөбе oблысында тарихи ескерткіші қoйылған батыр
| A batyr (hero) whose historical existence has been proven and whose historical monument has been erected in the Aqtöbe region
|
| KazCulture | |
Сүт тағамдарын қазақтар қалай бір сөзбен атайды?
| How do Kazakhs call dairy products in one word?
|
| Model | Mode | Average | MMLU-Pro | KazMMLU | KazCulture | |||
|---|---|---|---|---|---|---|---|---|
| EN | KK | RU | KK | RU | ||||
| KazLLM (8B) | - | 39.39 | 37.24 | 25.98 | 27.96 | 52.60 | 55.15 | 37.41 |
| Sherkala (8B) | - | 39.68 | 34.58 | 27.03 | 30.72 | 55.01 | 53.01 | 37.75 |
| Qolda (4B) | nothink | 45.21 | 42.12 | 33.36 | 37.50 | 51.93 | 56.85 | 49.48 |
| think | 61.24 | 65.00 | 58.39 | 62.95 | 67.10 | 67.38 | 46.63 | |
| Method | Mode | Average | KazMMLU | KazCulture | |
|---|---|---|---|---|---|
| KK | RU | ||||
| Baseline (Qolda Zero-Shot) | nothink | 52.75 | 51.93 | 56.85 | 49.48 |
| think | 60.37 | 67.10 | 67.38 | 46.63 | |
| Naïve RAG Qolda | nothink | 71.98 (+19.23) | 71.91 (+20.00) | 78.36 (+21.51) | 65.66 (+16.18) |
| think | 76.00 (+15.63) | 80.95 (+13.85) | 84.45 (+17.07) | 62.59 (+15.96) | |
| Query Refiner Model in Query-Refined RAG | |||||
| GPT-5-Nano ref. Qolda eval. | nothink | 66.01 (+13.26) | 66.45 (+14.52) | 75.27 (+18.42) | 56.30 (+6.82) |
| think | 71.14 (+10.77) | 77.44 (+10.34) | 81.87 (+14.49) | 54.12 (+7.49) | |
| Gemini-3-Flash ref. Qolda eval. | nothink | 74.76 (+22.01) | 74.73 (+22.80) | 78.95 (+22.10) | 70.61 (+21.13) |
| think | 79.46 (+19.09) | 83.67 (+16.57) | 85.06 (+17.68) | 69.64 (+23.01) | |
| Qolda ref. Qolda eval. | nothink | 68.23 (+15.48) | 69.20 (+17.27) | 72.06 (+15.21) | 63.42 (+13.94) |
| think | 73.16 (+12.79) | 79.20 (+12.10) | 79.27 (+11.89) | 61.02 (+14.39) | |
| Qolda-SFT ref. Qolda eval. | nothink | 71.42 (+18.67) | 72.42 (+20.49) | 75.49 (+18.64) | 66.34 (+16.86) |
| think | 76.19 (+15.82) | 81.75 (+14.65) | 82.19 (+14.81) | 64.62 (+17.99) | |
| Open-Source LLM Baselines | |||||
| Qwen3-32B | nothink | 60.28 (+7.53) | 67.81 (+15.88) | 72.88 (+16.03) | 40.14 (−9.34) |
| think | 64.72 (+4.35) | 75.48 (+8.38) | 79.39 (+12.01) | 39.28 (−7.35) | |
| Gemma-3-27b-it | - | 60.24 (+7.49) | 64.89 (+12.96) | 68.69 (+11.84) | 47.13 (−2.35) |
| Method | Mode | Average | MMLU-Pro | ||
|---|---|---|---|---|---|
| EN | KK | RU | |||
| Baseline (Qolda Zero-Shot) | nothink | 37.66 | 42.12 | 33.36 | 37.50 |
| think | 62.11 | 65.00 | 58.39 | 62.95 | |
| Naïve RAG Qolda | nothink | 41.23 (+3.57) | 51.44 (+9.32) | 33.01 (−0.35) | 39.24 (+1.74) |
| think | 62.71 (+0.60) | 68.01 (+3.01) | 57.53 (−0.86) | 62.60 (−0.35) | |
| Query Refiner Model in Query-Refined RAG | |||||
| GPT-5-Nano ref. Qolda eval. | nothink | 43.51 (+5.85) | 50.52 (+8.40) | 37.50 (+4.14) | 42.51 (+5.01) |
| think | 63.26 (+1.15) | 67.66 (+2.66) | 58.85 (+0.46) | 63.27 (+0.32) | |
| Gemini-3-Flash ref. Qolda eval. | nothink | 43.81 (+6.15) | 53.89 (+11.77) | 35.11 (+1.75) | 42.42 (+4.92) |
| think | 64.43 (+2.32) | 70.50 (+5.50) | 58.63 (+0.24) | 64.17 (+1.22) | |
| Qolda ref. Qolda eval. | nothink | 40.38 (+2.72) | 48.97 (+6.85) | 33.22 (−0.14) | 38.96 (+1.46) |
| think | 61.75 (-0.36) | 66.21 (+1.21) | 57.15 (−1.24) | 61.88 (−1.07) | |
| Qolda-SFT ref. Qolda eval. | nothink | 40.93 (+3.27) | 49.94 (+7.82) | 33.20 (−0.16) | 39.65 (+2.15) |
| think | 62.24 (+0.13) | 67.28 (+2.28) | 57.22 (−1.17) | 62.23 (−0.72) | |
| Method | Mode | MMLU-Pro | KazMMLU | KazCulture |
|---|---|---|---|---|
| Baseline (Qolda Zero-Shot) | nothink | 37.66 [37.18, 38.16] | 54.72 [54.08, 55.36] | 49.48 [46.85, 52.10] |
| think | 62.11 [61.61, 62.61] | 67.26 [66.66, 67.86] | 46.63 [43.93, 49.25] | |
| Naïve RAG Qolda | nothink | 41.23 [40.74, 41.73] | 75.58 [75.02, 76.14] | 65.67 [63.12, 68.22] |
| think | 62.71 [62.21, 63.21] | 82.94 [82.47, 83.42] | 62.59 [59.97, 65.22] | |
| Query Refiner Model in Query-Refined RAG | ||||
| GPT-5-Nano ref. Qolda eval. | nothink | 43.51 [42.99, 44.01] | 71.47 [70.87, 72.05] | 56.30 [53.60, 58.92] |
| think | 63.26 [62.76, 63.75] | 79.96 [79.44, 80.48] | 54.12 [51.42, 56.75] | |
| Gemini-3-Flash ref. Qolda eval. | nothink | 43.81 [43.28, 44.32] | 77.13 [76.58, 77.67] | 70.61 [68.14, 73.01] |
| think | 64.43 [63.93, 64.92] | 84.46 [84.00, 84.93] | 69.64 [67.17, 72.11] | |
| Qolda ref. Qolda eval. | nothink | 40.38 [39.87, 40.88] | 70.82 [70.24, 71.41] | 63.42 [60.79, 66.04] |
| think | 61.75 [61.25, 62.25] | 79.24 [78.71, 79.76] | 61.02 [58.40, 63.57] | |
| Qolda-SFT ref. Qolda eval. | nothink | 40.93 [40.42, 41.43] | 74.17 [73.59, 74.75] | 66.34 [63.79, 68.82] |
| think | 62.24 [61.73, 62.74] | 82.00 [81.50, 82.51] | 64.62 [62.07, 67.24] | |
| Method | Average | MMLU-Pro | KazMMLU | KazCulture |
|---|---|---|---|---|
| Baseline (Qolda Zero-Shot) | 12.86 | 21.99 | 9.67 | 6.92 |
| Naïve RAG Qolda | 185.05 (+172.19) | 178.12 (+156.13) | 194.86 (+185.19) | 182.17 (+175.25) |
| Query Refiner Model in Query-Refined RAG | ||||
| GPT-5-Nano ref. Qolda eval. | 180.75 (+167.89) | 218.28 (+196.29) | 191.36 (+181.69) | 132.61 (+125.69) |
| Gemini-3-Flash ref. Qolda eval. | 191.42 (+178.56) | 186.12 (+164.13) | 204.79 (+195.12) | 183.36 (+176.44) |
| Qolda ref. Qolda eval. | 189.32 (+176.46) | 185.10 (+163.11) | 204.97 (+195.30) | 177.88 (+170.96) |
| Qolda-SFT ref. Qolda eval. | 202.87 (+190.01) | 200.45 (+178.46) | 211.99 (+202.32) | 196.16 (+189.24) |
| Method | Average | MMLU-Pro | KazMMLU | KazCulture |
|---|---|---|---|---|
| Naïve RAG Qolda | 12.86 | 21.99 | 9.67 | 6.92 |
| Query Refiner Model in Query-Refined RAG | ||||
| GPT-5-Nano ref. Qolda eval. | 13.20 (+0.34) | 16.84 (−5.15) | 12.02 (+2.35) | 10.73 (+3.81) |
| Gemini-3-Flash ref. Qolda eval. | 7.21 (−5.65) | 9.12 (−12.87) | 6.95 (−2.72) | 5.56 (−1.36) |
| Qolda ref. Qolda eval. | 9.00 (−3.86) | 12.36 (−9.63) | 8.38 (−1.29) | 6.26 (−0.66) |
| Qolda-SFT ref. Qolda eval. | 6.39 (−6.47) | 8.13 (−13.86) | 5.73 (−3.94) | 5.31 (−1.61) |
| Method | Query Transformation (KazCulture) |
|---|---|
| Naïve RAG Qolda | Жалпақ алтын, күміс бетіндегі бедерлі ернеуге түрлі түсті тастар oрнатылып жасалған білезік қалай аталады? |
| What is the name of a bracelet made with colorful stones set into an embossed rim on its flat gold or silver surface? | |
| Query Refiner Model in Query-Refined RAG | |
| GPT-5-Nano ref. Qolda eval. | бедерлі білезік атауы түрлі түсті тастар oрнатылған білезік |
| Embossed bracelet name bracelet with inset colorful stones | |
| Gemini-3-Flash ref. Qolda eval. | бедерлі ернеуге түрлі түсті тастар oрнатылған білезік қалай аталады |
| What is the name of a bracelet with colorful stones set into an embossed rim | |
| Qolda ref. Qolda eval. | Түрлі түсті тастары бар бедерлі алтын білезік қалай аталады? |
| What is the name of an embossed gold bracelet with colorful stones? | |
| Qolda-SFT ref. Qolda eval. | жалпақ алтын күміс бедерлі білезік |
| Flat gold silver embossed bracelet | |
| Mode | Result | Total | Snippet Category | |||
|---|---|---|---|---|---|---|
| Explicit | Supportive | Irrelevant | Misleading | |||
| Total Count | 1089 | 411 | 212 | 437 | 29 | |
| nothink | Correct | 57.4 (625) | 87.1 (358) | 51.4 (109) | 34.6 (151) | 24.1 (7) |
| Incorrect | 42.6 (464) | 12.9 (53) | 48.6 (103) | 65.4 (286) | 75.9 (22) | |
| think | Correct | 69.7 (759) | 88.1 (362) | 67.9 (144) | 55.1 (241) | 41.4 (12) |
| Incorrect | 30.3 (330) | 11.9 (49) | 32.1 (68) | 44.9 (196) | 58.6 (17) | |
| Method | Mode | Total | Query Generation | Web Search | Inference |
|---|---|---|---|---|---|
| Baseline (Qolda Zero-Shot) | nothink | 0.00 | 0.00 | ||
| think | 0.00 | 0.00 | |||
| Naïve RAG Qolda | nothink | 0.00 | |||
| think | 0.00 | ||||
| Query Refiner Model in Query-Refined RAG | |||||
| GPT-5-Nano ref. Qolda eval. | nothink | ||||
| think | |||||
| Gemini-3-Flash ref. Qolda eval. | nothink | ||||
| think | |||||
| Qolda ref. Qolda eval. | nothink | ||||
| think | |||||
| Benchmark | Mode | Contingency Matrix | p-Value | Result | ||||
|---|---|---|---|---|---|---|---|---|
| Both | Baseline | RAG | Neither | |||||
| MMLU-Pro | nothink | 10,711 | 2884 | 4172 | 18,329 | 234.75 | <0.0001 | RAG Win |
| think | 14,017 | 8403 | 8620 | 5056 | 2.74 | 0.0978 | Tie | |
| KazMMLU | nothink | 11,035 | 1491 | 6265 | 4098 | 2937.28 | <0.0001 | RAG Win |
| think | 12,795 | 2600 | 6190 | 1304 | 1465.41 | <0.0001 | RAG Win | |
| KazCulture | nothink | 537 | 123 | 339 | 335 | 100.05 | <0.0001 | RAG Win |
| think | 396 | 226 | 439 | 273 | 67.58 | <0.0001 | RAG Win | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Maxutov, A.; Medeu, N.; Varol, H.A. Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Mach. Learn. Knowl. Extr. 2026, 8, 128. https://doi.org/10.3390/make8050128
Maxutov A, Medeu N, Varol HA. Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Machine Learning and Knowledge Extraction. 2026; 8(5):128. https://doi.org/10.3390/make8050128
Chicago/Turabian StyleMaxutov, Akylbek, Nūrali Medeu, and Huseyin Atakan Varol. 2026. "Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model" Machine Learning and Knowledge Extraction 8, no. 5: 128. https://doi.org/10.3390/make8050128
APA StyleMaxutov, A., Medeu, N., & Varol, H. A. (2026). Web Search-Enhanced Small Language Models: A Case Study for a Kazakh-Centric Language Model. Machine Learning and Knowledge Extraction, 8(5), 128. https://doi.org/10.3390/make8050128

