Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned
Abstract
1. Introduction
- RQ1
- Can we overcome the traditional one-versus-all (OVA) Transformer Encoder (TE) classification baselines in a multilingual setting?
- RQ2
- How effective is the method without label description embeddings?
- RQ3
- Can pre-trained multilingual IR models be used effectively without fine-tuning?
- RQ4
- Can we improve the method by incorporating additional contrastive learning with hard negative mining?
- The study demonstrates that IR models can outperform end-to-end fine-tuned models on specific datasets when working with an extremely large number of labels, providing a computationally efficient approach to scaling to an arbitrary number of labels.
- The experiments demonstrate that a significant gap remains between end-to-end fine-tuned models and IR models on less-represented non-English tasks, rendering IR models more suitable for these tasks in the XMC setting.
- When evaluating the modified RAE-XMC method on a novel and distinct dataset in a multilingual setting, we demonstrate that employing contrastive learning with precomputed hard negatives is beneficial.
- Additionally, we demonstrate the effectiveness of the approach without relying on label space embeddings while also handling longer texts than those typically considered in traditional XMC tasks.
- Finally, by comparing the performance of the modified RAE-XMC in less-represented, non-English languages, we demonstrate that the method can be easily adapted to multilingual settings, without significant loss in performance, as shown in the case of Slovene.
2. Related Work
3. Data Description
3.1. Data Preparation
- Named Entities (NE): mostly Persons, Organisations, Products, Brands and rarely Locations.
- General Concepts (GC): a set of terms representing a concept, for example E-Mobility, Health insurance, Climate change, and Industrial waste.
3.2. Data Analysis
4. Methodology
| Algorithm 1 Modified RAE-XMC Inference |
|
5. Experimental Setting
Inference Efficiency (Latency & Memory)
6. Results
6.1. Classification Performance Analysis
- The higher degree of label diversity present in the EURLEX57K dataset, which makes correct predictions harder;
6.2. Factor Attribution Analysis
7. Conclusions: Summary of Advances and Lessons Learned
8. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Label Distribution Metrics
Appendix B. Serbian and Macedonian Language Preliminary Research
Appendix B.1. Data Analysis
| Dataset | Samples | Labels | Cardinality | Density | Diversity |
|---|---|---|---|---|---|
| NewsMon | 1,068,261 | 12,305 | 3.392 ± 4.017 | 0.00028 | 267,564 |
| NewsMoninitial | 1,068,261 | 1960 | 2.213 ± 2.316 | 0.00114 | 183,361 |
| NewsMonsr+dups | 88,074 | 2149 | 2.835 ± 3.181 | 0.00132 | 20,272 |
| NewsMonsr | 83,947 | 2128 | 2.826 ± 3.190 | 0.00133 | 19,809 |
| NewsMonmk+dups | 20,451 | 505 | 4.821 ± 3.863 | 0.00955 | 3022 |
| NewsMonmk | 12,133 | 494 | 4.341 ± 3.773 | 0.00879 | 2556 |

| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 69.15 | 76.45 | 63.13 | 33.43 | 39.26 | 30.97 | 44.80 |
| LogReg | OVA-TFIDF | 68.70 | 77.54 | 61.66 | 32.61 | 39.64 | 29.51 | 44.67 |
| BGE-M3 | zshot | 63.71 | 61.93 | 65.59 | 33.72 | 33.48 | 36.40 | 45.04 |
| BGE-M3 | RAE-XMC | 68.86 | 76.85 | 62.38 | 34.40 | 39.41 | 32.70 | 47.92 |
| FT-BGEsl | zshot | 65.65 | 65.07 | 66.23 | 35.19 | 35.73 | 37.22 | 46.33 |
| FT-BGEsl | RAE-XMC | 68.62 | 82.64 | 58.66 | 33.50 | 39.69 | 31.05 | 48.98 |
Appendix B.2. Preliminary Experimental Results
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 81.72 | 87.67 | 76.53 | 46.07 | 49.96 | 44.14 | 49.43 |
| LogReg | OVA-TFIDF | 80.98 | 88.38 | 74.72 | 44.23 | 49.46 | 41.64 | 47.71 |
| BGE-M3 | zshot | 76.72 | 74.62 | 78.95 | 44.67 | 44.28 | 46.80 | 47.14 |
| BGE-M3 | RAE-XMC | 80.45 | 85.81 | 75.72 | 43.39 | 47.33 | 41.92 | 51.14 |
| FT-BGEsl | zshot | 80.29 | 79.48 | 81.11 | 47.95 | 47.74 | 49.34 | 49.02 |
| FT-BGEsl | RAE-XMC | 82.68 | 90.56 | 76.06 | 45.17 | 50.13 | 42.97 | 51.63 |
| Dataset | ΔμF1 HN | ΔμF1 RAE | ΔμF1 Int. | ΔμF1 Total | ΔAcc HN | ΔAcc RAE | ΔAcc Int. | ΔAcc Total |
|---|---|---|---|---|---|---|---|---|
| NewsMonsr | +1.94 | +5.15 | −2.18 | +4.90 | +1.29 | +2.88 | −0.23 | +3.95 |
| NewsMonmk | +3.56 | +3.72 | −1.33 | +5.96 | +1.88 | +4.00 | −1.39 | +4.49 |
Appendix C. The Original RAE-XMC Framework

Appendix D. Hyperparameters
Appendix D.1. SVM Hyperparameters
- Radial basis function (RBF) kernel,
- Regularisation parameter (C) was set to 10,000 for NewsMonsl and 10 for EURLEX57K.
- Gamma coefficient for RBF was set to 0.001 for NewsMonsl and 1.0 for EURLEX57K.
- Maximum term document frequency was set to 0.8 for NewsMonsl and 1.0 for EURLEX57K.
Appendix D.2. Logistic Regression Hyperparameters
- L2 penalty,
- Tolerance for stopping criteria: 0.0001,
- Inverse of regularisation strength (C) set to 1000 for NewsMonsl and 10 for EURLEX57K.
- Maximum term document frequency was set to 0.8 for NewsMonsl and EURLEX57K.
Appendix D.3. XLM-RoBERTa-Base Fine-Tuning Hyperparameters
- AdamW optimiser with a learning rate of 3 × 10.
- Weight decay set to 0.01 for regularisation.
- Training for a maximum of 30 epochs.
- Batch size of 16.
- Maximum length of 512 sub-word tokens.
- 30 epochs.
- Best model selection based on the validation set micro F1-score.
Appendix D.4. BGE-M3 Fine-Tuning Hyperparameters
- AdamW optimiser with a learning rate of 1 × 10.
- Train group size of 4.
- Batch size of 2.
- Maximum length of 4096 sub-word tokens for query and passage.
- Temperature 0.02.
- 20 epochs.
- Precision fp16.
Appendix D.5. Retrieval Model Hyperparameter Search
| Model | Dataset | Top-k | ||
|---|---|---|---|---|
| BGE-M3 | EURLEX57K | 10 | 0.031 | 0.998 |
| BGE-M3 | NewsMon | 16 | 0.091 | 0.999 |
| FT-BGE | EURLEX57K | 69 | 0.031 | 1.000 |
| FT-BGE | NewsMon | 13 | 0.098 | 0.816 |
Appendix E. Evaluation Metrics
Appendix E.1. Example-Based Evaluation Metrics
Appendix E.2. Label-Based Evaluation Metrics
References
- Rupnik, J.; Muhic, A.; Leban, G.; Skraba, P.; Fortuna, B.; Grobelnik, M. News across languages-cross-lingual document similarity and event tracking. J. Artif. Intell. Res. 2016, 55, 283–316. [Google Scholar] [CrossRef]
- Zhang, Y.; Guo, F.; Shen, J.; Han, J. Unsupervised Key Event Detection from Massive Text Corpora. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Washington, DC, USA, 14–18 August 2022; pp. 2535–2544. [Google Scholar] [CrossRef]
- Yoon, S.; Lee, D.; Zhang, Y.; Han, J. Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Taipei, Taiwan, 23–27 July 2023; pp. 802–811. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2020, arXiv:1910.10683. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Tarekegn, A.N.; Ullah, M.; Cheikh, F.A. Deep Learning for Multi-Label Learning: A Comprehensive Survey. arXiv 2024, arXiv:2401.16549. [Google Scholar] [CrossRef]
- Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [Google Scholar] [CrossRef]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- Abdin, M.; Aneja, J.; Behl, H.; Bubeck, S.; Eldan, R.; Gunasekar, S.; Harrison, M.; Hewett, R.J.; Javaheripi, M.; Kauffmann, P.; et al. Phi-4 Technical Report. arXiv 2024, arXiv:2412.08905. [Google Scholar] [CrossRef]
- Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ramé, A.; Rivière, M.; et al. Gemma 3 technical report. arXiv 2025, arXiv:2503.19786. [Google Scholar] [CrossRef]
- Subramanian, S.; Elango, V.; Gungor, M. Small Language Models (SLMs) Can Still Pack a Punch: A survey. arXiv 2025, arXiv:2501.05465. [Google Scholar]
- Vajjala, S.; Shimangaud, S. Text Classification in the LLM Era—Where do we stand? arXiv 2025, arXiv:2502.11830. [Google Scholar]
- Muralidharan, S.; Sreenivas, S.T.; Joshi, R.; Chochowski, M.; Patwary, M.; Shoeybi, M.; Catanzaro, B.; Kautz, J.; Molchanov, P. Compact Language Models via Pruning and Knowledge Distillation. arXiv 2024, arXiv:2407.14679. [Google Scholar] [CrossRef]
- Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. arXiv 2023, arXiv:2306.08543. [Google Scholar]
- Malinovskii, V.; Mazur, D.; Ilin, I.; Kuznedelev, D.; Burlachenko, K.; Yi, K.; Alistarh, D.; Richtarik, P. PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression. arXiv 2024, arXiv:2405.14852. [Google Scholar]
- Kuzman, T.; Ljubešić, N. LLM Teacher-Student Framework for Text Classification with No Manually Annotated Data: A Case Study in IPTC News Topic Classification. IEEE Access 2025, 13, 35621–35633. [Google Scholar] [CrossRef]
- Dasgupta, A.; Lamba, P.; Kushwaha, A.; Ravish, K.; Katyan, S.; Das, S.; Kumar, P. Review of Extreme Multilabel Classification. arXiv 2023, arXiv:2302.05971. [Google Scholar] [CrossRef]
- Wang, Y.S.; Chang, W.C.; Jiang, J.Y.; Zhang, J.; Yu, H.F.; Vishwanathan, S.V.N. Retrieval-augmented Encoders for Extreme Multi-label Text Classification. arXiv 2025, arXiv:2502.10615. [Google Scholar]
- Dai, X.; Chalkidis, I.; Darkner, S.; Elliott, D. Revisiting Transformer-based Models for Long Document Classification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 7212–7230. [Google Scholar] [CrossRef]
- Duan, L.; You, Q.; Wu, X.; Sun, J. Multilabel Text Classification Algorithm Based on Fusion of Two-Stream Transformer. Electronics 2022, 11, 2138. [Google Scholar] [CrossRef]
- Liu, M.; Liu, L.; Cao, J.; Du, Q. Co-attention network with label embedding for text classification. Neurocomputing 2022, 471, 61–69. [Google Scholar] [CrossRef]
- Yarullin, R.; Serdyukov, P. BERT for Sequence-to-Sequence Multi-label Text Classification. In Analysis of Images, Social Networks and Texts; van der Aalst, W.M.P., Batagelj, V., Ignatov, D.I., Khachay, M., Koltsova, O., Kutuzov, A., Kuznetsov, S.O., Lomazova, I.A., Loukachevitch, N., Napoli, A., et al., Eds.; Springer: Cham, Switzerland, 2021; pp. 187–198. [Google Scholar] [CrossRef]
- Fallah, H.; Bruno, E.; Bellot, P.; Murisasco, E. Exploiting Label Dependencies for Multi-Label Document Classification Using Transformers. In Proceedings of the ACM Symposium on Document Engineering 2023, DocEng ’23, Limerick, Ireland, 22–25 August 2023. [Google Scholar] [CrossRef]
- Li, B.; Chen, Y.; Zeng, L. Kenet:Knowledge-Enhanced DOC-Label Attention Network for Multi-Label Text Classification. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11961–11965. [Google Scholar] [CrossRef]
- Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
- Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 66–71. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Park, H.; Vyas, Y.; Shah, K. Efficient Classification of Long Documents Using Transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 702–709. [Google Scholar] [CrossRef]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
- Zaheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. arXiv 2020, arXiv:2007.14062. [Google Scholar]
- Ding, M.; Zhou, C.; Yang, H.; Tang, J. CogLTX: Applying BERT to Long Texts. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
- Pappagari, R.; Zelasko, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Hierarchical Transformers for Long Document Classification. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 838–844. [Google Scholar] [CrossRef]
- Jaiswal, A.; Milios, E. Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT. arXiv 2023, arXiv:2310.20558. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Lin, T.; Goyal, P.; Girshick, R.B.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
- Cui, Y.; Jia, M.; Lin, T.; Song, Y.; Belongie, S.J. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 9268–9277. [Google Scholar] [CrossRef]
- Wu, T.; Huang, Q.; Liu, Z.; Wang, Y.; Lin, D. Distribution-Balanced Loss for Multi-label Classification in Long-Tailed Datasets. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 162–178. [Google Scholar]
- Huang, Y.; Giledereli, B.; Köksal, A.; Özgür, A.; Ozkirimli, E. Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8153–8161. [Google Scholar] [CrossRef]
- Piskorski, J.; Stefanovitch, N.; Da San Martino, G.; Nakov, P. SemEval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup. In Proceedings of the 17th International Workshop on Semantic Evaluation, SemEval’23, Toronto, Canada, 13–14 July 2023. [Google Scholar]
- Liao, Q.; Lai, M.; Nakov, P. MarsEclipse at SemEval-2023 Task 3: Multi-Lingual and Multi-Label Framing Detection with Contrastive Learning. arXiv 2023, arXiv:2304.14339. [Google Scholar]
- Reiter-Haas, M.; Ertl, A.; Innerhofer, K.; Lex, E. mCPT at SemEval-2023 Task 3: Multilingual Label-Aware Contrastive Pre-Training of Transformers for Few- and Zero-shot Framing Detection. arXiv 2023, arXiv:2303.09901. [Google Scholar]
- Tunstall, L.; Reimers, N.; Jo, U.E.S.; Bates, L.; Korat, D.; Wasserblat, M.; Pereg, O. Efficient Few-Shot Learning Without Prompts. arXiv 2022, arXiv:2209.11055. [Google Scholar] [CrossRef]
- Zheng, L.; Xiong, J.; Zhu, Y.; He, J. Contrastive Learning with Complex Heterogeneity. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, Washington, DC, USA, 14–18 August 2022; pp. 2594–2604. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar] [CrossRef]
- Zhang, X.; Zhang, Y.; Long, D.; Xie, W.; Dai, Z.; Tang, J.; Lin, H.; Yang, B.; Xie, P.; Huang, F.; et al. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FlL, USA, 12–16 November 2024; pp. 1393–1412. [Google Scholar]
- Sturua, S.; Mohr, I.; Akram, M.K.; Günther, M.; Wang, B.; Krimmel, M.; Wang, F.; Mastrapas, G.; Koukounas, A.; Koukounas, A.; et al. jina-embeddings-v3: Multilingual Embeddings with Task LoRA. arXiv 2024, arXiv:2409.10173. [Google Scholar]
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 14–16 August 2024; pp. 2318–2335. [Google Scholar] [CrossRef]
- Dahiya, K.; Gupta, N.; Saini, D.; Soni, A.; Wang, Y.; Dave, K.; Jiao, J.; K, G.; Dey, P.; Singh, A.; et al. NGAME: Negative Mining-aware Mini-batching for Extreme Classification. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM ’23, Singapore, 27 February–3 March 2023; pp. 258–266. [Google Scholar] [CrossRef]
- You, R.; Zhang, Z.; Wang, Z.; Dai, S.; Mamitsuka, H.; Zhu, S. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; pp. 5812–5822. [Google Scholar]
- Chang, W.; Yu, H.; Zhong, K.; Yang, Y.; Dhillon, I.S. Taming Pretrained Transformers for Extreme Multi-label Text Classification. In Proceedings of the KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual, 23–27 August 2020; pp. 3163–3171. [Google Scholar]
- Jiang, T.; Wang, D.; Sun, L.; Yang, H.; Zhao, Z.; Zhuang, F. LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification. arXiv 2021, arXiv:2101.03305. [Google Scholar] [CrossRef]
- Zhang, R.; Wang, Y.S.; Yang, Y.; Yu, D.; Vu, T.; Lei, L. Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2–6 May 2023; pp. 1092–1106. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Gupta, N.; Khatri, D.; Rawat, A.S.; Bhojanapalli, S.; Jain, P.; Dhillon, I. Dual-encoders for Extreme Multi-label Classification. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Magueresse, A.; Carles, V.; Heetderks, E. Low-resource Languages: A Review of Past Work and Future Challenges. arXiv 2020, arXiv:2006.07264. [Google Scholar] [CrossRef]
- Pakray, P.; Gelbukh, A.; Bandyopadhyay, S. Natural language processing applications for low-resource languages. Nat. Lang. Process. 2025, 31, 183–197. [Google Scholar] [CrossRef]
- Chalkidis, I.; Fergadiotis, E.; Malakasiotis, P.; Androutsopoulos, I. Large-Scale Multi-Label Text Classification on EU Legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6314–6322. [Google Scholar] [CrossRef]
- Xiong, L.; Xiong, C.; Li, Y.; Tang, K.F.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, 4–8 August 2019; Teredesai, A., Kumar, V., Li, Y., Rosales, R., Terzi, E., Karypis, G., Eds.; ACM: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Cox, D.R. The Regression Analysis of Binary Sequences (with Discussion). J. R. Stat. Soc. B 1958, 20, 215–242. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
- Zhang, M.L.; Zhou, Z.H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar] [CrossRef]
- Lu, Z.; Li, X.; Cai, D.; Yi, R.; Liu, F.; Zhang, X.; Lane, N.D.; Xu, M. Small Language Models: Survey, Measurements, and Insights. arXiv 2024, arXiv:2409.15790. [Google Scholar] [CrossRef]
- Enevoldsen, K.; Chung, I.; Kerboua, I.; Kardos, M.; Mathur, A.; Stap, D.; Gala, J.; Siblini, W.; Krzemiński, D.; Winata, G.I.; et al. MMTEB: Massive Multilingual Text Embedding Benchmark. arXiv 2025, arXiv:2502.13595. [Google Scholar] [CrossRef]
- Jang, S.; Morabito, R. Edge-First Language Model Inference: Models, Metrics, and Tradeoffs. arXiv 2025, arXiv:2505.16508. [Google Scholar] [CrossRef]
- Bucher, M.J.J.; Martini, M. Fine-Tuned ’Small’ LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv 2024, arXiv:2406.08660. [Google Scholar]
- Galke, L.; Scherp, A.; Diera, A.; Karl, F.; Lin, B.X.; Khera, B.; Meuser, T.; Singhal, T. Are We Really Making Much Progress in Text Classification? A Comparative Review. arXiv 2022, arXiv:2204.03954. [Google Scholar]
- Chang, T.A.; Arnett, C.; Tu, Z.; Bergen, B.K. When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages. arXiv 2023, arXiv:2311.09205. [Google Scholar] [CrossRef]
- Gupta, V.; Chowdhury, S.P.; Zouhar, V.; Rooein, D.; Sachan, M. Multilingual Performance Biases of Large Language Models in Education. arXiv 2025, arXiv:2504.17720. [Google Scholar] [CrossRef]
- Tarekegn, A.N.; Giacobini, M.; Michalak, K. A review of methods for imbalanced multi-label classification. Pattern Recognit. 2021, 118, 107965. [Google Scholar] [CrossRef]
- Bernardini, F.C.; da Silva, R.B.; Rodovalho, R.M.; Meza, E.B.M. Cardinality and Density Measures and Their Influence to Multi-Label Learning Methods. Learn. Nonlinear Model. 2014, 12, 53–71. [Google Scholar] [CrossRef]
- Raschka, S.; Patterson, J.; Nolet, C. Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence. arXiv 2020, arXiv:2002.04803. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]











| Keyword Expression | Explanation |
|---|---|
| tesla OR teslo OR tesli OR tesle –“nikol* tesl*” | Match the Tesla car brand if Nikola Tesla the person-phrase is not present in text |
| mercedes* –formul* –“max verstap*” –F1 … | Match the Mercedes car brand if Formula 1 terms and phrases are absent. |
| nlb* –“lig* nlb” –“nlb lig*” | Match all forms of NLB terms, but only if the NLB-sponsored league phrases are not present. (NLB is an acronym for the company Nova Ljubljanska Banka) |
| gradnj* –cest* –avtocest* –obvoznic* | Match the term construction without the terms: road, highway or ring road |
| “сенад* сoфтић*” OR “senad* softić*” | Match the person-phrase with various inflections and in various scripts |
| Dataset | Samples | Labels | Cardinality | Density | Diversity |
|---|---|---|---|---|---|
| NewsMon | 1,068,261 | 12,305 | 3.392 ± 4.017 | 0.00028 | 267,564 |
| NewsMoninitial | 1,068,261 | 1960 | 2.213 ± 2.316 | 0.00114 | 183,361 |
| NewsMonsl+dups | 62,049 | 4052 | 3.034 ± 3.492 | 0.00075 | 17,307 |
| NewsMonsl | 50,784 | 3231 | 2.995 ± 3.463 | 0.00093 | 15,809 |
| EURLEX57K | 57,000 | 4271 | 5.069 ± 1.701 | 0.00119 | 34,982 |
| Model | Tokens | Dim | Params [M] | Mem [MB] | ML | EN |
|---|---|---|---|---|---|---|
| GTE-mb | 8192 | 768 | 305 | 582 | 6 | 31 |
| Jina-v3 | 8194 | 1024 | 572 | 1092 | 8 | 7 |
| BGE-M3 | 8194 | 1024 | 568 | 2167 | 5 | 89 |
| Model | Frequency | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| GTE-mb | All | 57.44 | 55.21 | 59.86 | 27.02 | 27.38 | 29.47 | 40.02 |
| Jina-v3 | All | 58.55 | 56.27 | 61.03 | 27.58 | 27.87 | 30.11 | 40.94 |
| BGE-M3 | All | 59.04 | 55.84 | 62.62 | 28.28 | 28.12 | 31.18 | 42.58 |
| GTE-mb | Frequent | 73.99 | 76.30 | 71.81 | 70.28 | 74.18 | 67.86 | 62.79 |
| Jina-v3 | Frequent | 74.93 | 77.23 | 72.78 | 71.20 | 74.61 | 69.00 | 64.51 |
| BGE-M3 | Frequent | 76.41 | 78.41 | 74.51 | 72.63 | 75.67 | 70.63 | 65.77 |
| GTE-mb | Rare | 50.59 | 72.85 | 38.75 | 13.89 | 14.75 | 13.54 | 30.75 |
| Jina-v3 | Rare | 50.68 | 72.24 | 39.03 | 13.90 | 14.71 | 13.73 | 30.54 |
| BGE-M3 | Rare | 52.59 | 71.70 | 41.53 | 14.72 | 15.52 | 14.45 | 33.12 |
| Model | Frequency | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| GTE-mb | All | 45.32 | 45.47 | 45.17 | 14.73 | 15.92 | 15.40 | 12.39 |
| Jina-v3 | All | 60.67 | 60.75 | 60.58 | 22.12 | 23.44 | 23.01 | 18.89 |
| BGE-M3 | All | 66.52 | 66.67 | 66.37 | 25.65 | 26.74 | 26.76 | 23.12 |
| GTE-mb | Frequent | 57.00 | 57.71 | 56.30 | 52.95 | 54.20 | 52.19 | 24.14 |
| Jina-v3 | Frequent | 72.45 | 72.76 | 72.14 | 68.95 | 69.30 | 68.86 | 35.65 |
| BGE-M3 | Frequent | 77.45 | 78.00 | 76.91 | 74.51 | 75.34 | 73.97 | 40.68 |
| GTE-mb | Rare | 18.29 | 38.41 | 12.00 | 3.71 | 4.30 | 3.49 | 9.52 |
| Jina-v3 | Rare | 27.25 | 41.32 | 20.32 | 6.08 | 6.68 | 6.02 | 15.63 |
| BGE-M3 | Rare | 34.14 | 47.32 | 26.70 | 8.10 | 8.81 | 8.00 | 20.60 |
| Metric | Value |
|---|---|
| Model steady-state GPU memory | 1147.4 MB |
| Peak GPU memory @ batch size | 2645.0 MB |
| Max allocated GPU memory @ batch size | 7578.7 MB |
| Peak host process memory | 2940.6 MB |
| Knowledge memory (host) | 748.5 MB |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 68.13 | 80.26 | 59.19 | 28.41 | 34.00 | 26.26 | 44.88 |
| LogReg | OVA-TFIDF | 67.47 | 83.34 | 56.68 | 26.76 | 33.39 | 24.21 | 45.22 |
| XLMR | baseline | 53.50 | 85.21 | 39.17 | 4.70 | 6.81 | 3.99 | 38.02 |
| BGE-M3 | zshot | 59.04 | 55.84 | 62.62 | 28.28 | 28.12 | 31.18 | 42.58 |
| BGE-M3 | ML-KNN | 42.94 | 77.94 | 29.63 | 5.92 | 9.77 | 4.78 | 29.63 |
| BGE-M3 | RAE-XMC | 65.58 | 76.11 | 57.61 | 28.18 | 32.61 | 26.88 | 44.49 |
| FT-BGE | zshot | 68.93 | 67.20 | 70.75 | 31.21 | 31.71 | 33.47 | 51.38 |
| FT-BGE | ML-KNN | 62.18 | 86.48 | 48.54 | 9.30 | 13.47 | 7.89 | 46.67 |
| FT-BGE | RAE-XMC | 73.67 | 86.07 | 64.39 | 29.24 | 34.64 | 27.27 | 56.12 |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 73.21 | 82.92 | 65.54 | 25.65 | 30.98 | 23.58 | 25.58 |
| LogReg | OVA-TFIDF | 71.61 | 82.07 | 63.51 | 23.84 | 29.38 | 21.63 | 22.34 |
| XLMR | baseline | 75.98 | 93.18 | 64.14 | 13.95 | 18.83 | 12.19 | 24.25 |
| BGE-M3 | zshot | 66.52 | 66.67 | 66.37 | 25.65 | 26.74 | 26.76 | 23.12 |
| BGE-M3 | ML-KNN | 57.68 | 81.49 | 44.63 | 9.63 | 14.50 | 8.03 | 11.77 |
| BGE-M3 | RAE-XMC | 69.72 | 74.63 | 65.41 | 25.73 | 28.69 | 25.11 | 23.47 |
| FT-BGE | zshot | 68.37 | 68.40 | 68.33 | 26.81 | 27.95 | 28.00 | 25.58 |
| FT-BGE | ML-KNN | 59.96 | 82.54 | 47.08 | 10.55 | 15.51 | 8.92 | 12.84 |
| FT-BGE | RAE-XMC | 69.99 | 72.35 | 67.77 | 27.12 | 29.42 | 27.25 | 25.91 |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 81.40 | 88.43 | 75.40 | 77.41 | 86.29 | 70.87 | 67.35 |
| LogReg | OVA-TFIDF | 81.55 | 90.54 | 74.18 | 77.23 | 88.48 | 69.34 | 67.79 |
| XLMR | baseline | 82.65 | 90.24 | 76.29 | 77.24 | 86.99 | 70.98 | 70.26 |
| BGE-M3 | zshot | 76.41 | 78.41 | 74.51 | 72.63 | 75.67 | 70.63 | 65.77 |
| BGE-M3 | ML-KNN | 66.15 | 85.25 | 54.05 | 56.36 | 81.71 | 45.97 | 50.18 |
| BGE-M3 | RAE-XMC | 78.96 | 87.44 | 71.97 | 74.20 | 85.08 | 67.13 | 66.24 |
| FT-BGE | zshot | 86.71 | 87.39 | 86.05 | 83.42 | 84.89 | 82.60 | 78.09 |
| FT-BGE | ML-KNN | 86.22 | 91.56 | 81.47 | 81.62 | 89.18 | 76.55 | 76.36 |
| FT-BGE | RAE-XMC | 88.31 | 93.59 | 83.59 | 84.51 | 91.35 | 79.40 | 79.64 |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 82.26 | 87.24 | 77.82 | 79.74 | 85.40 | 75.62 | 46.56 |
| LogReg | OVA-TFIDF | 80.63 | 85.81 | 76.05 | 77.88 | 83.99 | 73.33 | 43.10 |
| XLMR | baseline | 90.18 | 94.15 | 86.54 | 87.56 | 93.39 | 83.95 | 62.79 |
| BGE-M3 | zshot | 77.45 | 78.00 | 76.91 | 74.51 | 75.34 | 73.97 | 40.68 |
| BGE-M3 | ML-KNN | 72.46 | 84.55 | 63.39 | 65.64 | 80.44 | 58.57 | 29.04 |
| BGE-M3 | RAE-XMC | 80.02 | 83.06 | 77.20 | 76.87 | 80.46 | 74.11 | 43.30 |
| FT-BGE | zshot | 79.02 | 79.53 | 78.52 | 76.35 | 76.99 | 75.93 | 43.59 |
| FT-BGE | ML-KNN | 74.29 | 85.31 | 65.79 | 68.49 | 81.49 | 61.66 | 31.65 |
| FT-BGE | RAE-XMC | 80.05 | 81.71 | 78.46 | 77.40 | 79.32 | 75.85 | 44.86 |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 49.12 | 90.15 | 33.76 | 11.81 | 12.65 | 11.42 | 27.51 |
| LogReg | OVA-TFIDF | 45.06 | 92.51 | 29.79 | 10.55 | 11.32 | 10.18 | 24.24 |
| XLMR | baseline | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| BGE-M3 | zshot | 52.59 | 71.70 | 41.53 | 14.72 | 15.52 | 14.45 | 33.12 |
| BGE-M3 | ML-KNN | 0.00 | 0.00 | 0.00 | 0.06 | 0.06 | 0.06 | 0.00 |
| BGE-M3 | RAE-XMC | 50.20 | 90.58 | 34.72 | 12.53 | 13.45 | 12.09 | 27.96 |
| FT-BGE | zshot | 55.00 | 73.66 | 43.89 | 15.06 | 15.64 | 15.15 | 34.84 |
| FT-BGE | ML-KNN | 0.00 | 0.00 | 0.00 | 0.06 | 0.06 | 0.06 | 0.00 |
| FT-BGE | RAE-XMC | 48.30 | 92.89 | 32.64 | 11.93 | 12.88 | 11.46 | 26.02 |
| Model | Method | μF1 | μP | μR | F1 | P | R | Acc |
|---|---|---|---|---|---|---|---|---|
| SVM | OVA-TFIDF | 26.57 | 72.19 | 16.28 | 4.95 | 5.43 | 4.78 | 14.07 |
| LogReg | OVA-TFIDF | 30.19 | 78.28 | 18.70 | 5.57 | 6.15 | 5.33 | 18.76 |
| XLMR | baseline | 0.52 | 0.00 | 0.26 | 0.05 | 0.08 | 0.04 | 0.28 |
| BGE-M3 | zshot | 34.14 | 47.32 | 26.70 | 8.10 | 8.81 | 8.00 | 20.60 |
| BGE-M3 | ML-KNN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| BGE-M3 | RAE-XMC | 33.38 | 57.87 | 23.46 | 7.27 | 7.94 | 7.07 | 18.04 |
| FT-BGE | zshot | 36.21 | 50.19 | 28.32 | 8.40 | 9.19 | 8.24 | 22.59 |
| FT-BGE | ML-KNN | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| FT-BGE | RAE-XMC | 36.48 | 57.58 | 26.70 | 7.97 | 8.81 | 7.78 | 21.88 |
| Dataset/Split | ΔμF1 HN | ΔμF1 RAE | ΔμF1 Int. | ΔμF1 Total | ΔAcc HN | ΔAcc RAE | ΔAcc Int. | ΔAcc Total |
|---|---|---|---|---|---|---|---|---|
| NewsMonsl (All) | +9.89 | +6.54 | −1.80 | +14.63 | +8.80 | +1.91 | +2.83 | +13.54 |
| NewsMonsl (Frequent) | +10.30 | +2.55 | −0.95 | +11.90 | +12.32 | +0.47 | +1.08 | +13.87 |
| NewsMonsl (Rare) | +2.41 | −2.39 | −4.31 | −4.29 | +1.72 | −5.16 | −3.66 | −7.10 |
| EURLEX57K (All) | +1.85 | +3.20 | −1.58 | +3.47 | +2.46 | +0.35 | −0.02 | +2.79 |
| EURLEX57K (Frequent) | +1.57 | +2.58 | −1.55 | +2.60 | +2.91 | +2.63 | −1.35 | +4.18 |
| EURLEX57K (Rare) | +2.07 | −0.75 | +1.03 | +2.34 | +1.99 | −2.56 | +1.85 | +1.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ivačič, N.; Škrlj, B.; Koloski, B.; Pollak, S.; Lavrač, N.; Purver, M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Mach. Learn. Knowl. Extr. 2025, 7, 142. https://doi.org/10.3390/make7040142
Ivačič N, Škrlj B, Koloski B, Pollak S, Lavrač N, Purver M. Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction. 2025; 7(4):142. https://doi.org/10.3390/make7040142
Chicago/Turabian StyleIvačič, Nikola, Blaž Škrlj, Boshko Koloski, Senja Pollak, Nada Lavrač, and Matthew Purver. 2025. "Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned" Machine Learning and Knowledge Extraction 7, no. 4: 142. https://doi.org/10.3390/make7040142
APA StyleIvačič, N., Škrlj, B., Koloski, B., Pollak, S., Lavrač, N., & Purver, M. (2025). Extreme Multi-Label Text Classification for Less-Represented Languages and Low-Resource Environments: Advances and Lessons Learned. Machine Learning and Knowledge Extraction, 7(4), 142. https://doi.org/10.3390/make7040142

