Next Article in Journal
Delving into Unsupervised Hebbian Learning from Artificial Intelligence Perspectives
Previous Article in Journal
Democratizing Machine Learning: A Practical Comparison of Low-Code and No-Code Platforms
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned

1
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova Cesta 39, 1000 Ljubljana, Slovenia
2
Jožef Stefan International Postgraduate School, Jamova Cesta 39, 1000 Ljubljana, Slovenia
3
Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, 1000 Ljubljana, Slovenia
4
School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2025, 7(4), 142; https://doi.org/10.3390/make7040142
Submission received: 9 October 2025 / Revised: 1 November 2025 / Accepted: 7 November 2025 / Published: 11 November 2025
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

Abstract

Amid ongoing efforts to develop extremely large, multimodal models, there is increasing interest in efficient Small Language Models (SLMs) that can operate without reliance on large data-centre infrastructure. However, recent SLMs (e.g., LLaMA or Phi) with up to three billion parameters are predominantly trained in high-resource languages, such as English, which limits their applicability to industries that require robust NLP solutions for less-represented languages and low-resource settings, particularly those requiring low latency and adaptability to evolving label spaces. This paper examines a retrieval-based approach to multi-label text classification (MLC) for a media monitoring dataset, with a particular focus on less-represented languages, such as Slovene. This dataset presents an extreme MLC challenge, with instances labelled using up to twelve thousand categories. The proposed method, which combines retrieval with computationally efficient prediction, effectively addresses challenges related to multilinguality, resource constraints, and frequent label changes. We adopt a model-agnostic approach that does not rely on a specific model architecture or language selection. Our results demonstrate that techniques from the extreme multi-label text classification (XMC) domain outperform traditional Transformer-based encoder models, particularly in handling dynamic label spaces without requiring continuous fine-tuning. Additionally, we highlight the effectiveness of this approach in scenarios involving rare labels, where baseline models struggle with generalisation.
Keywords: multi-label text classification; multilingual text classification; retrieval; low-resource environments; less-represented languages; media monitoring; news topics multi-label text classification; multilingual text classification; retrieval; low-resource environments; less-represented languages; media monitoring; news topics

Share and Cite

MDPI and ACS Style

Ivačič, N.; Škrlj, B.; Koloski, B.; Pollak, S.; Lavrač, N.; Purver, M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Mach. Learn. Knowl. Extr. 2025, 7, 142. https://doi.org/10.3390/make7040142

AMA Style

Ivačič N, Škrlj B, Koloski B, Pollak S, Lavrač N, Purver M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction. 2025; 7(4):142. https://doi.org/10.3390/make7040142

Chicago/Turabian Style

Ivačič, Nikola, Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač, and Matthew Purver. 2025. "Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned" Machine Learning and Knowledge Extraction 7, no. 4: 142. https://doi.org/10.3390/make7040142

APA Style

Ivačič, N., Škrlj, B., Koloski, B., Pollak, S., Lavrač, N., & Purver, M. (2025). Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction, 7(4), 142. https://doi.org/10.3390/make7040142

Article Metrics

Back to TopTop